# Table 1 Methodological criteria and rationale

Criteria Rationale
Stage 1:
1) Rationale:
a) Was the analysis a- priori (planned in protocol rather than post- hoc)? The need for a theoretical basis for choice of measurement to be tested as moderator or mediator. Ideally, the planned analysis is a- priori.
b) Was selection of factors for analysis theory/evidence driven? "Ideally, these hypotheses are initially theory driven, then empirically confirmed, and finally clinically evaluated to establish their real-world existence." Nicholson et al., 2005
2) Method:
a) Was there an equal distribution of moderators between groups at baseline? Ideally, a-priori stratification in design (Lipchick et al., 2005, Headache).
b) Were moderators measured prior to randomisation? "...a hypothesized moderator must be measured prior to randomization " Nicholson et al., 2005 page 516)
3) Power:
Do authors report a power analysis for moderator effect (a-priori or post-hoc, but using an a-priori ES, not the observed one? "In planning a test for the effect of a moderator, the researcher must pre-specify the size of a moderator effect that is deemed to be substantively important. The power calculation determines whether the statistical test proposed to detect the moderator effect has sufficient power... Retrospectively, power analyses may be used to evaluate moderator analyses that have already been conducted, by providing information about how sensitive the statistical tests were to detect substantively meaningful effects. If tests for moderator effects are found to have low power, statistically non-significant effects of moderators do not provide strong evidence for ruling out moderator effects. Alternatively, if a test for a moderator is found to have very high statistical power to detect the smallest substantively meaningful effect, then failure to detect that effect is evidence that moderator effects are not likely to be large enough to be substantively meaningful.
In the retrospective application of power analysis, as in the prospective one, the researcher must pre-specify the size of a moderator effect that is deemed to be substantively important. That is, the moderator effect size must be determined a- priori. In particular, the observed effect of the moderator variable should never be used to compute statistical power" (extracted directly from Hedges & Pigott, 2004, page 427)
Sufficient power to detect small/moderate effects in moderator analysis has been defined as at least four times that of the main effect (based on the fact that most main effects are in this order of magnitude).
Was sample size adequate for the moderator analysis (at least 4 fold the required sample size for main treatment effect in the lowest sub-group for the moderator factor)? "The ability of formal interaction tests to (correctly) identify sub-group effects improved as the size of the interaction increased relative to the overall treatment effect. When the size of the interaction was twice the overall effect or greater, the interaction tests had at least the same power as the overall treatment effect. However, power was considerably reduced for smaller interactions, which are much more likely in practice. The inflation factor required to increase the sample size to enable detection of the interaction with the same power as the overall effect varied with the size of the interaction. For an interaction of the same magnitude as the overall effect, the inflation factor was 4" (HTA 2001 5 (33).
If not, were there at least 20 people in the smallest sub-group of the moderator? An inherent problem is that power in RCTs is almost always calculated based on the main effect of treatment. Arbitrary cut-point has been used by other systematic reviews of at least 10 in lowest arm of completed treatment (Eccelston et al., updated for Cochrane, 2009.) We have included this arbitrary criterion to ensure retention of studies that were under-powered in isolation but might still add value to meta-analyses. However, with sub-groups below 20, we considered the study to be unlikely to be informative.
Have authors employed analysis to compensate for insufficient power (i.e. boot-strapping techniques?) This criterion was included because some researchers attempt such analysis, and its value is debatable.
4) Correction for multiple comparisons:
a) Was the regression significant at P < 0.05, or (if more than three comparisons) corrected or significance adjusted to P < 0.01? In the absence of a-priori stratification, studies often explore several sub-groups, and the risk of type I error is considerably increased. The adjustment of P values has been used in RCT analysis (Turner, 2007).
b) Did the authors explore residual variances of interactions if carrying out multiple two-way interactions? Multiple two-way interactions
In many studies, researchers evaluate two or more moderators in a single analysis. For example, a regression model might include terms for XZ, WX, and WZ. Researchers sometimes make inferences about relative importance when, say, one of the three interaction terms is statistically significant and the others are not. However, such inferences require equivalent statistical power for each test. It might well be the case that the interaction terms are equivalent in terms of the sizes of their partial regression coefficients but that there are differences in statistical reliability due entirely to differences in the residual variances of the interactions terms. Thus when examining multiple two-way interactions, one ought to compare the residual variances of those interactions before making any inferences about their relative importances (McClelland & Judd, 1993, page 385)
5) Measurement validity & measurement error: Was measurement of baseline and process factors reliable and valid (from published information) in target population? Measurement error considerably inhibits the power of studies to detect interactions.
a) Is there evidence that the measurement error of the instrument is likely to be sufficiently small to detect the differences between sub-groups that are likely to be important? "Estimates of the reliability of measures should be reported. Measurement unreliability in M would be expected to lead to underestimating b' and thus ab', and overestimating c' (R. Baron & Kenny, 1986; Hoyle & Kenny, 1999)." Gelfand et al., 2009 p168
b) Did the authors comment on measurement validity in reference to construct validity, face validity etc? "Trafimow (2006) described a concern for the construct validity of measures that is roughly analogous to that raised by measurement unreliability but for which there is currently no means of correction." Gelfand et al., 2009 p169.
6)Analysis:
a) Contains an explicit test of the interaction between moderator and treatment (e.g. regression)? "Sub-group analyses should always be based on formal tests of interaction although even these should be interpreted with caution." HTA, 2001
b) Was there adjustment for other baseline factors?
c) Is there an explicit presentation of the differences in outcome between baseline sub-groups (e.g. standardised mean difference between groups, Cohen's d).
Stage 2:
1. Differences between sun-groups should be clinically plausible. Selection of characteristics should be motivated by biological and clinical hypotheses, ideally supported by evidence from sources other than the included studies. Subgroup analyses using characteristics that are implausible or clinically irrelevant are not likely to be useful and should be avoided. " Section 9.6.5.4
2. Reporting of sub-group analysis is only justified in cases where the magnitude of the different is large enough to support different recommendations for different sub-groups. "If the magnitude of a difference between subgroups will not result in different recommendations for different subgroups, then it may be better to present only the overall analysis results." Section 9.6.6
3. Within study comparisons are more reliable than between study comparisons. "For patient and intervention characteristics, differences in subgroups that are observed within studies are more reliable than analyses of subsets of studies. If such within-study relationships are replicated across studies then this adds confidence to the findings. " Section 9.6.6
4. At least ten observations should be available for each characteristic explored in sub-group analysis (i.e., ten studies in a meta analysis). "It is very unlikely that an investigation of heterogeneity will produce useful findings unless there is a substantial number of studies. It is worth noting the typical advice for undertaking simple regression analyses: that at least ten observations (i.e. ten studies in a meta-analysis) should be available for each characteristic modelled. However, even this will be too few when the covariates are unevenly distributed. Section 9.6.5.1