Can we trust statistics in fMRI studies?
Functional MRI (fMRI) is one of the most celebrated tools in neuroscience. Because of their unique ability to peer directly into the living brain while an organism thinks, feels and behaves, fMRI studies are often devoted disproportionate media attention, replete with flashy headlines and often grandiose claims. However, the technique has come under a fair amount of criticism from researchers questioning the validity of the statistical methods used to analyze fMRI data, and hence the reliability of fMRI findings. Can we trust those flashy headlines claiming that “scientists have discovered the <insert political affiliation, emotion or personality trait> area of the brain,” or are the masses of fMRI studies plagued by statistical shortcomings? To explore why these studies can be vulnerable to experimental failure, in their new PLOS One study coauthors Henk Cremers, Tor Wager and Tal Yarkoni investigated common statistical issues encountered in typical fMRI studies, and proposed how to avert them moving forward.
FMRI simulation
The reliability of any experiment depends on adequate power to detect real effects and reject spurious ones, which can be influenced by various factors including the sample size (or number of “subjects” in fMRI), how strong the real effect is (“effect size”), whether comparisons are within or between subjects, and the statistical threshold used. To characterize common statistical culprits of fMRI studies, Cremers and colleagues first simulated typical fMRI scenarios before validating these simulations on a real dataset. One scenario simulated weak but diffusely distributed brain activity, and the other scenario simulated strong but localized brain activity (Figure 1). The simulation revealed that effect sizes are generally inflated for weak diffuse, compared to strong localized, activations, especially when the sample size is small. In contrast, effect sizes can actually be underestimated for strong localized scenarios when the sample size is large. Thus, more isn’t always better when it comes to fMRI; the optimal sample size likely depends on the specific brain-behavior relationship under investigation.
Real data application: Human Connectome Project
Next, Cremers and colleagues analyzed a real fMRI dataset of nearly 500 subjects performing a social cognition task collected through the Human Connectome Project. To represent typical fMRI studies, random samples of 15 subjects were drawn to assess within-subject effects, and random samples of 30 subjects were used for between-subjects analyses. Results from these examples were compared to data from the full sample of 500 subjects—a rough proxy for true effects. Similar to the simulation, effect sizes negatively correlated with sample size, decreasing as the sample size grew. Because within-subject contrasts are by nature higher-powered, and because of the strong brain-behavior effect analyzed, real within-subject activations were readily detected even with the small sample of 15 individuals, resembling the simulated “strong localized” effects. In contrast, between-subject effects, even with twice the sample size, were unreliable and inflated, closely resembling the simulated “weak diffuse” scenario (Figure 2).
These findings offer a few key take-away lessons. Underpowered studies can result in grossly overestimated effects and inconsistent activations, leading to an inability to replicate results. In part driven by a common fear of Type I errors (false positives), many studies miss legitimate effects and thus encourage incomplete or inaccurate inferences about brain-behavior relationships. Wager underscores that these pitfalls aren’t entirely unique to fMRI:
“Having a healthy skepticism applies to most areas of science—genetics, animal models of disorders, psychology, and much more—not just fMRI. And I think skepticism should be healthy skepticism—we can get provisionally excited about published ideas, and more excited about them as they pass the test of time and begin to be replicated across more groups. Publishing an exploratory finding that may not be true is not wrong, if you’ve made a good faith attempt to get at the truth—we just have to have patience and not ‘over-believe’ until the evidence comes in.”
Avoiding statistical pitfalls
So what can a neuroscientist do to adequately power their fMRI experiment to detect true effects while minimizing false or unreliable effects? A power analysis using realistic parameter estimates to compute the necessary sample size is a critical first step. Unfortunately, the requisite sample of fMRI data may be prohibitive in terms of time and money for an individual researcher without access to large datasets. The growing trend in “big data” may partially address this problem, but, as Wager explains, “it’s even more important to have ‘smart data.’ Large datasets can produce spurious results with high confidence in many cases where hidden causes are not properly accounted for. Smart data means careful sampling and analysis to improve inferences about what the real causes of effects are.” Once the data have been collected, a researcher can optimize their analysis by selecting the right balance between Type I (false positives) and Type II (false negatives) errors, which largely depends on the unique study scenario. Finally, the number of statistical comparisons can be limited by employing hypothesis-driven testing or by taking advantage of machine-learning, graph theory, or data reduction approaches.
A movement to standardize fMRI methods has recently emerged, propelled by papers including the OHBM Committee on Best Practice in Data Analysis and Sharing (COBIDAS) on MRI and “Guidelines for reporting an fMRI study.” According to Cremers, such efforts are “indeed essential, since there are so many choices to make when it comes to data acquisition and analyses. Also, initiatives like the Coursera fMRI course of Lindquist and Tor, are great and very important—they make the principles of fMRI research accessible to a wide audience.”
References
Cremers HR, Wager TD, Yarkoni T. (2017). The relation between statistical power and inference in fMRI. PLOS One. 12(11): e0184923. doi.org/10.1371/journal.pone.0184923
Nichols TE et al. (2016). Best Practices in Data Analysis and Sharing in Neuroimaging using MRI. bioRxiv doi: 10.1101/054262
Poldrack RA et al. (2008). Guidelines for reporting an fMRI study. Neuroimage. 40(2):409-414. 10.1016/j.neuroimage.2007.11.048
Any views expressed are those of the author, and do not necessarily reflect those of PLOS.
Emilie Reas received her PhD in Neuroscience from UC San Diego, where she used fMRI to study memory. As a postdoc at UCSD, she currently studies how the brain changes with aging and disease. In addition to her tweets for @PLOSNeuro she is @etreas.