Explaining Heterogeneity in Direct Replications: Just Methodological Artefacts?

From a meta-scientific perspective, heterogeneity in effect sizes is a highly relevant topic for several reasons: Such between-study variation is one of the candidate causes of replication problems. It affects the statistical power of primary studies and impedes the detection of publication biases and questionable research practices (the other potential main cause of replication problems). Finally, unexplained variation in effect sizes may be considered as an indication of the state of theory development in a field. Thus, it is little surprise that the replication crisis has sparked huge interest in the heterogeneity of psychological effects. Multi-center replication studies have made data available that makes it possible to investigate empirically the heterogeneity of psychological effects in direct replications on a larger scale. First analyses of these data suggest that heterogeneity is smaller in direct than in conceptual replications, but still occurs with considerable frequency.

This project is based on two observations regarding previous assessments of heterogeneity in psychological effects:

  • These assessments may have missed important information, as they focused almost entirely on heterogeneity in standardized effect sizes.
  • They may be subject to several methodological artefacts widely discussed in the meta-analytical literature and, therefore, be biased.

A focus on standardized effect sizes overlooks the fact that heterogeneity in these statistics may be due not only to variation in mean differences (that is, actual treatment effects) but also to variation in error variances. Heterogeneity in error variances may be caused by the use of convenience samples in replication studies and, hence, be theoretically fully uninformative. Exclusively analyzing standardized effect sizes may mask that already control group means (base levels) are heterogeneous. If so, expecting homogenous effects requires the additional, rather strict assumption of independence between base levels and mean differences. Methodological artefacts that may affect previous heterogeneity assessments include variation in measurement reliabilities and range restrictions in independent and dependent variables. Finally, earlier analyses treated Likert-scaled data as continuous. This model misspecification may have biased meta-analytic effect size and heterogeneity estimates.

We aim to reanalyze all available data from multi-center replication projects in psychology. In these re-analyses, we will not only assess the heterogeneity in standardized and unstandardized effect sizes, but also in all of their components (group means and error variances). We will conduct multilevel mixed-effects meta-analysis to assess the relationship between true control group means and true mean differences. Wherever possible, we will determine measurement and treatment reliabilities in direct replications and estimate their heterogeneity. In analyzing standardized effect sizes, we will apply corrections for unreliability and range restrictions. Likert-scaled data will be analyzed with more appropriate ordinal regression techniques and corresponding meta-analytic methods. Taken together, these efforts should not only provide a much more valid assessment of the heterogeneity in psychological effects. They should also reveal whether and to what degree various methodological and statistical problems actually biased previous heterogeneity estimates. Additionally, they may uncover relationships in the heterogeneities of components of effect sizes that are of theoretical importance.

In a second part of the project, we will use the results of the re-analyses as a starting point and a restriction for simulation studies. The aim here is to investigate how the aforementioned methodological and statistical artefacts can affect heterogeneity estimates of effect sizes under various conditions. The results should provide guidelines for the analysis of heterogeneity.

