Improving Meta-analyses in Sport and Exercise Science Will G Hopkins Sportscience 22, 11-17, 2018
(sportsci.org/2018/metaflaws.htm) |
|||||

Expressing Effects in the Same Metric Coping with Repeated Measurement Publication Bias and Outlier Studies See also the article/slideshow on meta-analysis. ## IntroductionAs a reviewer of meta-analyses for several journals, I have noticed a worrying trend: authors of submitted manuscripts often justify their flawed analyses by citing published articles with similar flaws. My aim in this article is to identify flaws in several recent meta-analyses in the hope of raising the quality of submitted and published meta-analyses in our disciplines. Numerous flaws in a recent meta-analysis by Coquart et al. (2016a) provided the stimulus for writing the article. I performed a search for meta-analyses published since 2011 over a wide range of topics to find more examples of errors and omissions (Burden et al., 2015; Chu et al., 2016; Hume et al., 2015; Josefsson et al., 2014; Peterson et al., 2011; Salvador et al., 2016; Soomro et al., 2015; Tomlinson et al., 2015; Wright et al., 2015). One of the meta-analyses (Soomro et al., 2015) turned out to have only one minor flaw. This article represents a summary of the flaws and advice on how to avoid them. I submitted the article to Sports Medicine in 2016, but the editor decided that he would prefer a meta-analysis of meta-analyses, something I did not have time or inclination to write. Instead, I wrote a letter to the editor about the Coquart et al. study (Hopkins, 2016), to which the authors responded by showing that they did not understand the difference between confidence limits and limits of agreement, by continuing to cite flawed meta-analyses as justification, and by claiming incorrectly that heterogeneity statistics are not always needed (Coquart et al., 2016b). After setting aside this article for a year, I decided to submit it for publication in Sportscience. I have identified the flaws under subheadings that reflect the temporal order in which a meta-analysis is usually analyzed. My assertions about the wrong and right ways to do a meta-analysis are based on an article/slideshow I published in 2004, which I have updated regularly (Hopkins, 2004), and on a more generic guidelines article (Hopkins et al., 2009). The right ways are exemplified (I hope) in recent meta-analyses I have co-authored (Bonetti and Hopkins, 2009; Braakhuis and Hopkins, 2015; Carr et al., 2011; Cassar et al., 2017; McGrath et al., 2015; Snowling and Hopkins, 2006; Somerville et al., 2016; Vandenbogaerde and Hopkins, 2011; Weston et al., 2014). The Cochrane handbook (Higgins and Green, 2011) is also a useful source of wisdom and software for simple meta-analyses, but only a powerful mixed modeling procedure (e.g., Proc Mixed in the Statistical Analysis System) can provide the appropriate random effects to deal with repeated measurement represented by multiple study-estimates from the same or different subjects in each study. ## Expressing Effects in the Same MetricA meta-analyzed effect is a weighted mean of the effect across all selected studies. All effects must therefore be expressed as the same kind of measure in the same units for the mean to be meaningful. Most of the meta-analyses I chose for this article used unsuitable approaches here. For differences or changes in means, a common approach is to standardize each effect by dividing by the appropriate between-subject standard deviation (SD), for example, the baseline SD of all subjects in a controlled trial. Although the magnitude of the resulting effects can be interpreted directly, differences in the SD between studies–reflecting different populations, different methods of sampling, or just sampling variation–will introduce heterogeneity that is unrelated to any real differences in the effect between studies. Josefsson et al. (2014) used standardization to cope with the various depression scales in different studies of the effect of exercise interventions on depression. In the six studies using the Beck depression inventory, the range in SD of baseline depression amounted to a factor of 7.5, which translates directly into artefactual heterogeneity. (The lowest SD is actually a standard error in the original publication; Josefsson et al. apparently used it incorrectly to standardize.) The best approach for dealing with disparate psychometric scales, including visual-analogue and multi-level Likert scales, is to linearly rescale all of them to a score with a minimum possible of 0 and a maximum possible of 100. The resulting score then represents an easily analyzed and interpreted percent of full range of the psychological construct. Standardization probably contributed to heterogeneity in the effects of iron supplementation on serum iron concentration, blood hemoglobin concentration and VO2max in the meta-analysis of Burden et al. (2015). The authors provided no SD in the various studies to allow assessment of this issue or to convert the meta-analyzed standardized effect back into more meaningful units. By checking data in some of the original publications, I determined that much of the heterogeneity must have arisen from incorrect use of standard errors rather than SD for standardizing some effects. Standardization would also have contributed to heterogeneity in the meta-analysis that Salvador et al. (2016) conducted on the effect of ischemic preconditioning on exercise performance, if they had used the baseline standard deviation to standardize. Unfortunately, through misuse of the analysis program they effectively used the standard error of the change scores, which made the resulting standardized magnitudes meaningless. Log transformation of effects expressed as factors followed by back-transformation to factors or percents is usually the best way to deal with physical performance and many other physiological measures. The decision between analysis of raw vs log-transformed effects hinges on which approach produces less heterogeneity. Peterson et al. (2011) used raw units (kg) for their meta-analysis of the effect of resistance exercise on lean body mass in aging adults, but if they had used log transformation, the apparently smaller effect in older adults would likely have disappeared or even reversed following back-transformation to percent units. Strangely, they provided irrelevant detail of a method of standardization using the SD of change scores, an error promulgated by the software package Comprehensive Meta-Analysis that again would have made the magnitudes meaningless. Dependent variables in the meta-analyses of Burden et al. (2015) and Salvador et al. (2016) were likely candidates for log transformation, as was strength in the meta-analysis of Tomlinson et al. (2015) on the effects of supplementation with vitamin D. Chu et al. (2016) may have made the right choice of raw units of concentration in their meta-analysis of the effect of exercise on plasma zinc, but they did not show enough data from each study for me to assess whether the effects were more uniform as factors. Effects on time-dependent events such as injury incidence are reported as ratios of odds, risks, incidence rates (hazards), or other rates (counts per some measure of exposure). When the risks are low–that is, <10% of the sample experience the event during the period of observation–all these ratios effectively have the same value, so they can be meta-analyzed and interpreted as risk ratios. When risks are higher, the hazard and other rate ratios coincide and are appropriate for interpreting and meta-analyzing factors affecting risk, whereas the odds ratio and the usual relative risk ratio increasingly overestimate and underestimate effects, respectively. Use of odds ratios by Wright et al. (2015) to meta-analyze risk factors in prospective studies of stress fractures in runners was therefore misguided, and they did not provide injury counts and sample sizes in each study to allow assessment of the extent to which their use of odds ratios introduced upward bias and artefactual heterogeneity in the risk factors they analyzed. (Anyone planning an injury meta-analysis should note that odds ratios in case-control studies are effectively hazard ratios, and that prevalence proportions should be meta-analyzed as odds ratios and converted back to proportion ratios for interpretation of clinical importance.) Validity studies in which a practical measure is compared with
a criterion provide several candidate measures for meta-analysis: the
correlation coefficient, mean bias, bias at a predicted value, random error,
and limits of agreement. The correlation coefficient is sensitive to the
between-subject SD, so I would usually avoid it. If biases and errors are
more uniform across studies when expressed as percents, they should be
converted to factors and log-transformed before analysis. As an SD, the
random error suffers from small-sample downward bias, a problem that is
easily solved by expressing it as a variance (after any log-transformation),
then taking the square root of the meta-analyzed mean. The variance also has
a well-defined standard error, which is needed for weighting the effects (see
below). The bias in the SD is practically negligible for the usual sample
sizes in validity studies, so no real harm was done when Coquart et al. (2016a) meta-analyzed SDs of the
difference between criterion VO2max in a maximal test and VO2max predicted by
submaximal tests. They then converted the meta-analyzed mean bias and mean
random error into mean limits of agreement, and in a serious omission they
provided no uncertainties (confidence limits) for any of these measures.
Furthermore, they showed meta-analyzed random-error components as about ±4
ml.min ## Dealing with Standard ErrorsA meta-analyzed effect is a weighted mean of effects, where the weighting factor is the inverse of the square of each effect's expected sampling variation, its standard error. Using a study-quality score as the weighting factor, as Hume et al. (2015) did in a meta-analysis of snow-sport injuries, is incorrect. Depending on the design of the studies and the analysis package, the meta-analyst may input data or inferential statistics (p values or confidence limits) from each study without having to derive or impute the standard error for each effect. Exactly what was done needs to be stated, to satisfy readers that this step was performed correctly and to guide future meta-analyses. Coquart et al. (2016a) provided no inferential statistics or information about the standard errors for the two validity statistics they meta-analyzed, bias and random error. Tomlinson et al. (2015) input post-intervention means and SD into the meta-analysis software, when they should have input mean pre-post change scores and associated inferential statistics. The way Burden et al. (2015) combined pre and post scores is unclear, and Peterson et al. (2011) did not provide enough data from each study for me to check their analyses. Most of the other meta-analysts did (Chu et al., 2016; Josefsson et al., 2014; Salvador et al., 2016; Soomro et al., 2015; Wright et al., 2015), but only Chu et al. (2016) and Soomro et al. (2015) also provided adequate documentation. ## Accounting for HeterogeneityHeterogeneity in a meta-analysis refers to real
differences between effect magnitudes, which arise not from sampling
variation but from moderation of the effect by differences between studies in
subject characteristics, environmental factors, study design, measurement
techniques, and/or method of analysis. The typical practice of testing for
heterogeneity with the I Having shown that there Of the meta-analyses I reviewed for this article, only
that of Soomro et al. (2015), on injury-prevention
programs in team sports, included adequate assessment of heterogeneity and
subgroup analyses. They did not present the random-effect SD in
comprehensible units, nor did they present its uncertainty, but they did
provide a Most other meta-analysts performed random-effect analyses
and subgroup analyses, but any conclusions they based on the I ## Coping with Repeated MeasurementA given study often provides several estimates of an effect that can be included in a meta-analysis, such as effects on males and females, or effects for different doses or time points. Such effects represent repeated measurement on the same study, so the usual meta-analytic mixed model with a single between-study random effect is not appropriate. Meta-analysts in the studies I reviewed attempted to cope with this problem either by treating the estimates as if they came from separate studies (Burden et al., 2015; Chu et al., 2016; Peterson et al., 2011; Salvador et al., 2016; Tomlinson et al., 2015), or by performing subgroup analyses that did not include repeated measurement (Coquart et al., 2016a). The problem with the former approach is that the resulting confidence interval for the overall mean effect is too narrow, while the resulting confidence intervals for any within-study moderators included in a meta-regression are wider than they need be. The problem with the latter approach is that the separately meta-analyzed effects in the subgroups cannot be compared inferentially, because they are not independent. The study of Coquart et al. (2016a) illustrates how wrong conclusions can be reached with an inappropriate analysis. They found similar meta-analyzed mean estimates of VO2max when they performed separate analyses for estimates predicted at perceived exertions of 19 and 20, but as you would expect, VO2max was substantially higher at the higher intensity in those studies where VO2max was predicted at both intensities, and I have little doubt that the right kind of meta-analysis would show that difference clearly. The correct approach to including and comparing multiple within-study estimates is a repeated-measures meta-regression, achieved by including one or more covariates to account for and estimate the within-study effects, and by including one or more random effects additional to the usual between-study random effect to account for clustering of estimates within studies. I have only ever included a single additional random effect in meta-analyses (Carr et al., 2011; Vandenbogaerde and Hopkins, 2011; Weston et al., 2014), but in future I may use two random effects to account for within-study between-subject clustering (e.g., sex) and within-study within-subject clustering (e.g., multiple doses or time points). ## Publication Bias and Outlier StudiesA pervasive tendency for only statistically significant effects to end up in print results in the overestimation of published effects, a phenomenon known as publication bias. Such bias was not an issue for the validity meta-analysis of Coquart et al. (2016a); five of the other meta-analysts did not mention the possibility of publication bias in their effects (Burden et al., 2015; Hume et al., 2015; Josefsson et al., 2014; Tomlinson et al., 2015; Wright et al., 2015) while four (Chu et al., 2016; Peterson et al., 2011; Salvador et al., 2016; Soomro et al., 2015) investigated asymmetry in the funnel-shaped plot of observed effects vs their standard errors, which is a sign of publication bias. There are two problems with this approach and corrections based on it: heterogeneity disrupts the funnel shape, thereby increasing the likelihood of false-negative and false-positive decisions about publication bias, and it does not take into account any heterogeneity explained in a meta-regression. A plot of the values of the study random-effect solution (effectively the study residuals) vs the study standard error solves these problems: publication bias manifests as a tendency for the residuals to be distributed non-uniformly for studies with higher values of the standard error, and repeating the analysis after deleting all such studies reduces or removes the bias (see especially Carr et al., 2011; Vandenbogaerde and Hopkins, 2011). Standardization of the random-effect values converts them to z scores, which also allow for objective identification and elimination of outlier studies. Chu et al. (2016) and Peterson et al. (2011) investigated the change in the magnitude of the meta-analyzed effect following deletion of one or more studies, presumably checking for outliers; this kind of sensitivity analysis is pointless, because the change in magnitude will be smaller with a larger total number of meta-analyzed studies, and there is no associated rationale for eliminating studies. The researchers may have done this kind of analysis simply because it was available in the analysis package Comprehensive Meta-Analysis. ## Interpreting MagnitudesA shortcoming with several of the meta-analyses is
inadequate attention to the clinical or practical importance of the
meta-analyzed effects, let alone that of their moderators and heterogeneity.
Coquart et al. (2016a) made no assessment of the
implications of the magnitude of the meta-analyzed validity statistics for
assessment of individual patients. Some authors apparently assumed that
statistical significance automatically confers importance on the effect,
without considering the magnitude of the observed effect or its confidence
limits (Chu et al.,
2016; Wright et al., 2015).
Others used various scales to interpret standardized differences in means,
without converting them back into real units to consider whether the
standardized magnitude could represent an important clinical or practical
effect in all or any populations (Burden et
al., 2015; Salvador et al., 2016; Tomlinson et al., 2015).
I support standardization for assessing differences or changes in means when
there is no real-world scale, but the standardization should be done As already noted, the SD representing heterogeneity should be doubled (or squared for factor SD) before assessing its magnitude with the same scale as that for assessing the mean effect. Effects of moderators expressed as correlation coefficients (Salvador et al., 2016), "beta" coefficients (Burden et al., 2015; Peterson et al., 2011) and p values (Chu et al., 2016) do not communicate magnitude to the reader. Moderators representing numeric subject characteristics (e.g., mean age) that have been included as simple linear predictors should be evaluated for a difference in the characteristic equal to two between-subject SDs appropriately averaged from selected studies (again, via weighted variances). A suspected non-linear moderator can be coded as quantiles or other sub-group levels and evaluated accordingly. Almost all the meta-analysts showed some reliance on p values to make conclusions, a practice that in my opinion is particularly inappropriate for meta-analyses. Inferences about the magnitude of all statistics should be based one way or another on the uncertainty represented by the magnitude of lower and upper confidence limits (Hopkins et al., 2009; Hopkins and Batterham, 2016). ## ConclusionI rate failure to use a random-effect meta-analysis and failure to properly account for heterogeneity in a random-effect meta-analysis as the most serious flaws, because heterogeneity combined with the mean effect determines a probabilistic range in the clinical or practical importance of the effect in a specific setting. As a researcher or practitioner, you should be cautious about implementing the findings of a meta-analysis lacking a full account of heterogeneity; use it primarily as a convenient reference list to find studies from settings similar to your own, and use these studies to draw your own conclusions about the magnitude of the effect in your setting. You should also be skeptical about any meta-analyzed differences or changes in means based on standardization: there is a good chance the authors will have made major errors, and even when done correctly, standardization results in artefactual heterogeneity. Authors need to provide more documentation about these and the other error-prone aspects of meta-analysis I have identified here, if readers are to have more trust in the findings. When I sent the first submitted version of this article to the authors of the meta-analyses for comment, one of them asked me to revise the article into a full meta-analysis of all recent meta-analyses. Such an article would represent a more even-handed critique, given that these meta-analysts would likely find themselves in the company of the authors of most other recent meta-analyses. A longer article will be justified if the quality of meta-analyses in our subject areas does not improve in the next year or two.
## References
Hopkins WG (2004). An introduction to meta-analysis.
Sportscience 8, 20-24 Hopkins WG (2015). Individual responses made easy. Journal
of Applied Physiology 118, 1444-1446 Josefsson T, Lindwall M, Archer T (2014). Physical exercise
intervention in depressive disorders: Meta‐analysis
and systematic review. Scandinavian Journal of Medicine and Science in Sports
24, 259-272 Published Jan 2018. |