NEWSFLASH! In September I will join The Conversation as its Business and Economy Editor. I have been honoured to work at The Age for the past ten years, originally alongside the legendry Tim Colebatch, and for the past four years as economics editor in my own right.

At The Conversation, my job will be to make the best thinking from Australia's 40 univerisites accessible to the widest possible audience. That means you. From the new year I will also write a weekly column.

On this site are most of the important things I have written for Fairfax and the ABC over the past few decades. I recommend the Search function. The site is a record for you, as well as me.

I'll continue to post great things from The Conversation and other places here, and also on Twitter and Facebook. Enjoy.

Sunday, December 13, 2015

Creepy and foolish. Why most research findings are wrong

Perhaps the creepiest science experiment ever conducted was called Personal Space Invasions in the Lavatory: Suggestive Evidence for Arousal.

It examined the behaviour of men at urinals. (Yes, I know I discussed urinals in a previous column; I'm not going to make a habit of it.)

The researchers hid in a toilet stall and used a periscope to observe the behaviour of men standing up attempting to urinate. When the men had, another man standing beside them they took longer to start, 6.2 seconds after unzipping their flies compared with 4.9 seconds. When another man (someone helping out with the experiment) stood behind them, they took even longer – 8.4 seconds.

And they finished up more quickly too. It's not comfortable when someone's invading your personal space.

I don't doubt for a moment that it takes men longer to start when someone is standing next to them. But the experiment had two major flaws. One is the researchers knew what they were looking for. The other is that they timed only 60 men.

The Australian Bureau of Statistics employment survey is conducted on about 25,000 men and 25,000 women each month, and even it gets things spectacularly wrong. National employment didn't grow by 71,375 in November as reported, it probably grew by less than 6000. And it didn't soar by 121,000 last August and then sink by 172,000 last September either. The spikes and troughs were artefacts, brought about by the way the survey was conducted. Little things such as the order in which questions are asked make an enormous difference, as the Bureau has discovered to its cost.

Yet we are repeatedly asked to trust the results of studies conducted only once on tiny numbers of people.

One of my favourites is the gourmet grocery store jam study...

The Californian researchers set up a jam tasting booth, which at times had only six varieties on it, and at other times 24. When the booth offered 24 varieties, customers were less likely to end up picking one to buy at a discount. Too much choice made it hard to choose. But the survey involved only 242 people. Would it stand up if it had been conducted again in a different location by different researchers?

There are reasons to think it might not.

In the mid-1990s some New York University psychologists performed scrambled sentence tests on 60 students. Half were asked to unscramble sentences containing ordinary words. The other half were given words specifically related to ageing, such as old, lonely, and grey. As the students walked out, a researcher with a stopwatch timed how long it took them to reach the lift. Those who had been "primed" took longer. It became an accepted psychological truth.

Except that when other researchers performed the experiment two decades later using infrared detection instead of a stopwatch, the effect disappeared. As with the stopwatch in the lavatory, the people doing the experiment had been able to tilt the results in the direction they wanted.

Three years ago, University of Virginia researcher Brian Nosek embarked on an epic "reproducibility project", an attempt to reproduce 100 of the results reported in leading psychology journals throughout 2008. His shocking finding, almost a decade later, is that fewer than one-third stand up.

And in most of those that do stand up, the effect is weaker than first reported. And not only in psychology. In all disciplines, from physics to economics to medicine, findings seem to get smaller each time an experiment is repeated. Drugs that seem effective when first tested appear to get weaker with each successive test.

Nosek thinks he knows why. Low sample sizes mean its very hard to get a finding that is statistically significant. When it does happen, it's often the result of chance. But it gets written up as a result. There's usually no attempt to get a second result. If there is, the cards will almost always fall the other way, making it less-impressive.

I have no doubt that many of the findings I have reported during the past few decades have been wrong. I've latched on to them because they've been written up in prestigious journals, and because they have seemed right. I shouldn't have. It's the outcomes that seem right that we should most distrust. Not because they are necessarily wrong, but because the researchers who found them wanted to find them. Next year I'll be less trusting.

In The Age and Sydney Morning Herald