 Correspondence
 Open Access
 Open Peer Review
 Published:
Thresholds for statistical and clinical significance in systematic reviews with metaanalytic methods
BMC Medical Research Methodology volume 14, Article number: 120 (2014)
Abstract
Background
Thresholds for statistical significance when assessing metaanalysis results are being insufficiently demonstrated by traditional 95% confidence intervals and Pvalues. Assessment of intervention effects in systematic reviews with metaanalysis deserves greater rigour.
Methods
Methodologies for assessing statistical and clinical significance of intervention effects in systematic reviews were considered. Balancing simplicity and comprehensiveness, an operational procedure was developed, based mainly on The Cochrane Collaboration methodology and the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) guidelines.
Results
We propose an eightstep procedure for better validation of metaanalytic results in systematic reviews (1) Obtain the 95% confidence intervals and the Pvalues from both fixedeffect and randomeffects metaanalyses and report the most conservative results as the main results. (2) Explore the reasons behind substantial statistical heterogeneity using subgroup and sensitivity analyses (see step 6). (3) To take account of problems with multiplicity adjust the thresholds for significance according to the number of primary outcomes. (4) Calculate required information sizes (≈ the a priori required number of participants for a metaanalysis to be conclusive) for all outcomes and analyse each outcome with trial sequential analysis. Report whether the trial sequential monitoring boundaries for benefit, harm, or futility are crossed. (5) Calculate Bayes factors for all primary outcomes. (6) Use subgroup analyses and sensitivity analyses to assess the potential impact of bias on the review results. (7) Assess the risk of publication bias. (8) Assess the clinical significance of the statistically significant review results.
Conclusions
If followed, the proposed eightstep procedure will increase the validity of assessments of intervention effects in systematic reviews of randomised clinical trials.
Introduction
Systematic reviews summarise the results from randomised clinical trials. Metaanalysis is the main statistical method used in systematic reviews to analyse pooled results of trials [1, 2]. Some claim that results of systematic reviews should be considered hypothesisgenerating and should primarily serve the purpose of designing future randomised clinical trials [3–5]. Others consider systematic reviews with metaanalysis the highest level of evidence assessing the effects of healthcare interventions [1, 2]. Studies have clearly shown that results of metaanalyses of trials with low risk of bias are more reliable than results of single large trials [6–11]. Inthout and colleagues quantified the error rates for evaluations based on single conventionally powered trials (80% or 90% power) compared to evaluations based on randomeffects metaanalyses of a series of smaller trials [6]. When a treatment was assumed to have no effect but heterogeneity was present, the error rates for a single trial were increased more than 10fold above the nominal rate [6]. Conversely, for metaanalyses on a series of trials, the error rates were correct [6]. When selective publication was present, the error rates were always increased, but they still tended to be lower for a series of trials than in a single trial [6]. It also appears intuitively evident that inclusion of all used data from all randomised clinical trials ever conducted shall be treated as a higher level of evidence compared to the data from only a single trial [2, 11–15]. We acknowledge that a systematic review with metaanalysis cannot be conducted with the same scientific cogency as a randomised clinical trial with predefined highquality methodology addressing an a priori and quantitatively hypothesised intervention effect. Systematic review authors often know some of the eligible randomised clinical trials before they prepare their protocol for the systematic review, and hence, the review methodology is partly data driven. Nevertheless, understanding the inherent methodological limitations of a systematic review should lead to minimisation of these methodological limitations and optimisation of the remaining review methodology, which is the objective of this paper.
We recently described an operational fivestep procedure for valid assessment of statistical and clinical significance in a single randomised clinical trial [12]. We will now, in a comparable manner, describe an eightstep procedure for a more valid assessment of results of systematic reviews of randomised clinical trials. Our procedure is based on and designed to be an extension to The Cochrane Collaboration Handbook and the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) principles [13, 16–19]. The eightstep procedure can be used as part of a planned systematic review methodology or can be used to assess the validity of results from already published systematic reviews [20].
The following eight sections of the manuscript will correspond to each step of the proposed procedure.
Methods
Step 1: metaanalysis, the 95% confidence interval, and the Pvalue
In a metaanalysis, a summary statistic is calculated for each included trial, describing the observed intervention effect [13]. Then, an aggregated intervention effect estimate is calculated as a weighted average of the intervention effects estimated from the individual trials [13]. Review authors should always report both the confidence interval and the corresponding exact Pvalue from all metaanalyses. The confidence interval will show the range of uncertainty (considering the chosen threshold for statistical significance) around the aggregated intervention effect estimate. The Pvalue will show the probability of obtaining the observed or even a larger difference in intervention effect (disregarding possible bias) assuming that the null hypothesis is true (the null hypothesis implies that there is no difference in effect between the compared interventions) [12].
In a fixedeffect metaanalysis, the underlying assumption is that all of the included trials estimate the same intervention effect, i.e., differences in observed effects across trials are assumed to be caused by random error (‘play of chance’) [13]. In a randomeffects metaanalysis, the underlying assumption is that the included trials do not estimate the same intervention effects – it is assumed that the estimates of individual trial intervention effects follow a normal or a log normal distribution [13]. The most commonly used randomeffects model is the DerSimonian and Laird model [21]. However, the HartungKnappSidikJonkman randomeffects model assuming a tdistribution of log (RR) (for dichotomous outcomes) seems to be a more valid metaanalysis method [22]. It is often likely that a given intervention will have different effects across the included trials depending on different forms of the interventions, different definitions of the outcomes, different types of included participants, etc. The randomeffects model assumption will, therefore, often be more realistic than the fixedeffect model assumption [13]. If there is absence of statistical heterogeneity (the between trial variance of the estimated intervention effects is close to zero [23]), then the fixedeffect and the randomeffects models will show identical results [13]. If there is substantial statistical heterogeneity, the fixedeffect metaanalysis will, in some circumstances, show erroneous results because the between trial variance is not appropriately accounted for. In such a case, the randomeffects metaanalysis result should be regarded as the main result. On the other hand, if one or two trials accounts for approximately 80% or more of the total weight in a fixedeffect metaanalysis, then the randomeffects metaanalysis might show erroneous results because the larger trials with the greatest precision are inappropriately downweighted [24]. In such a case, the fixedeffect metaanalysis result should be regarded as the main result. We recommend always reporting results from both fixedeffect and randomeffects metaanalyses. If the fixedeffect and the randomeffects metaanalyses show different results, then the most conservative result (the analysis with the highest Pvalue) should be chosen as the main result [13]. Choosing the most conservative result will take account of the mentioned pitfalls of the two analyses [13]. Substantial discrepancies between the results of the two methods should be reported and discussed thoroughly (see step 2).
Step 2: investigating statistical and clinical heterogeneity
Implications of clinical and statistical heterogeneity should always be considered when metaanalyses are conducted [13, 25, 26]. Substantial statistical heterogeneity may be identified by visual inspection of forest plots and by the calculated statistical heterogeneity (for example, I^{2} or D^{2}) [13, 25, 26]. It must be noted that the statistical heterogeneity (both I^{2} and D^{2}) will increase by including trials with large sample sizes because the variance of the trials’ intervention effects will decrease with the increased sample size and event size. As a result, the ratio of the between trial variance to the total variance will also increase. In other words, small and clinically irrelevant intervention effect differences across trials might lead to significant statistical heterogeneity when confidence intervals of the trial intervention effects are narrow [27]. Statistical heterogeneity should always be interpreted with caution but especially when analysing large sample sizes [21]. Underlying reasons behind significant statistical heterogeneity in metaanalyses should be investigated by assessing trial characteristics in subgroup analyses and sensitivity analyses (see step 6), and it should be reconsidered if all of the included trials should be included in the analyses. For example, if the statistical heterogeneity seems to be caused by different forms of the intervention or a different assessment procedure, then the need for excluding some of the trials from the main analysis should be considered. If any trial is excluded from any analysis, then the reasons for excluding the trial should be clearly reported in the manuscript.
Post hocanalyses
Ideally all analyses in a systematic review should be planned at the protocol stage, but post hoc analyses might be warranted if unexpected clinical or statistical heterogeneity is identified during the analysis of the review results. Nevertheless, post hoc analyses should always be interpreted with great caution and it should be made very clear, which analyses were predefined in the published protocol and which were not. Post hoc analysis should be regarded as exploratory and hypotheses generating.
Step 3: problems with multiplicity due to multiple outcomes
The overall risk of falsely rejecting the null hypothesis for at least one outcome (the familywise error rate) will increase with the number of outcome comparisons [28, 29]. Problems with multiplicity in systematic reviews have major implications for the interpretation of the confidence intervals and the Pvalues [29, 30] – and problems with multiplicity are often not accounted for in systematic reviews [28, 29, 31]. For example, thresholds for significance in metaanalyses are rarely adjusted if more than one primary outcome is used [13].
Most systematic reviews will include multiple outcome comparisons, and if review authors are free to choose and highlight single results among the many comparisons, there will be an increased risk of false declaration on the effectiveness of an assessed intervention. Data driven post hoc analyses will be avoided if the review methodology is clearly predefined and not changed during the analysis of the review results. A straightforward way to deal with some of the multiplicity problems is to publish a protocol before the literature search begins (for example, at PROSPERO (http://www.crd.york.ac.uk/PROSPERO/)) [13, 31–33]. In the protocol, the statistical methodology should be described in detail, including a clear definition of the primary, secondary, and exploratory outcomes [13, 31–33]. Proper adjustments according to problems with multiplicity (see below) should be estimated based on the outcome hierarchy specified in the protocol. The main conclusions of the review ought to be based on the results on the primary outcome/s. Hence, adjustments due to problems with multiplicity may also be limited to the primary outcome/s. This will in itself limit the risk of type I error and will make the threshold adjustments simpler and practically feasible. If an outcome is assessed a multiple number of times, then either the thresholds for significance should be adjusted accordingly or the time point of primary interest should be prespecified.
A systematic review should summarise all available evidence for a given medical intervention and choosing only one primary outcome will often be too restrictive. In advance, it will often be unknown, which outcomes the eligible trials have used and it is important to assess both beneficial and harmful intervention effects [5]. It is, therefore, often advisable to use more than one patientrelevant primary outcome in a systematic review. The Cochrane Collaboration recommends using up to three primary outcomes – for example, allcause mortality, serious adverse events, and quality of life [13]. The use of more than one primary outcome (coprimary outcomes) necessitates adjustments of the thresholds for significance because of problems with multiplicity [34].
Different statistical methods have been proposed to adjust confidence intervals and Pvalues when multiple outcome comparisons are used [12]. Most adjustment methods have focused on adjustments of the Pvalue threshold, but adjusted confidence intervals can often be calculated based on an adjusted Pvalue and an effect estimate [12, 35, 36]. There is an extensive statistical literature about problems with multiplicity in observational studies and randomised clinical trials [29], but problems with multiplicity in systematic reviews have received limited attention [29, 31]. Some methods have been developed to deal with problems with multiplicity in systematic reviews [37–39], but no simple and completely satisfactory solution to the problem of multiple comparisons in systematic reviews has been developed yet [29, 31].
The Bonferroni procedure divides the specified Pvalue threshold (for example, 0.05) with the number of outcome comparisons and this method can be used to control the familywise error rate [34]. If it is plausible that there is no correlation between multiple primary outcomes, then Bonferroni adjustment may be used. However, the Bonferroni procedure is mostly a too conservative adjustment method as most outcomes are interdependent (for example, all cause mortality and serious adverse events will often be positively correlated outcomes) [34]. To calculate more precise adjustments of the thresholds for significance in systematic reviews, an estimation of the correlation between the coprimary outcomes will be needed. Such a correlation will often be unknown and erroneous assumptions about correlations might lead to erroneous results. Because the ‘true’ multiplicity adjusted thresholds for significance lie somewhere between the unadjusted threshold (for example, 0.05) and the Bonferroni adjusted threshold, we suggest a pragmatic approach. We suggest dividing the prespecified Pvalue threshold with the value halfway between 1 (no adjustment) and the number of primary outcome comparisons (Bonferroni adjustment). This will result in a multiplicity adjusted threshold using 1 primary outcome = 0.05, 2 primary outcomes = 0.033, and 3 primary outcomes = 0.025.
The use of full Bayesian statistics can account for problems of multiplicity due to multiple testing [24, 40, 41]. However, this use would imply integration of complicated models and software for analysing the review results [24, 40, 41] and would entail the need for specifying multiple prior distributions, which has its own problems [12, 24, 40, 41].
Step 4: trial sequential analysis
Required information size
If the sample size has not been reached in a randomised clinical trial, then the threshold for statistical significance ought to be adjusted [12, 42]. A similar methodology should apply to a metaanalysis that does not reach a required information size (≈ the a priori required number of participants for a metaanalysis to be conclusive) [25, 26]. The 95% confidence interval may show the range of uncertainty of the observed intervention effect estimate and may improve the interpretability of metaanalysis result. However, reporting results from metaanalyses without linking the confidence intervals, the intervention effect estimates, and the Pvalues to an estimated required information size is erroneous for a number of reasons:

Most review authors do not assess if an accrued information size is sufficient or not to detect or reject a given intervention effect, and this is problematic as nearly all metaanalyses in systematic reviews are underpowered (the metaanalyses do not have enough information to reject or accept the null hypothesis when the null hypothesis is false) [25, 26, 43]. Without an estimation of a required information size it is difficult to interpret an apparent neutral (no significant difference) metaanalysis result – it becomes unclear whether a neutral metaanalysis result indicates that there is no difference in effect between the compared interventions, or if the result indicates that the information size is too small to demonstrate or discard the anticipated intervention effect [11]. Furthermore, metaanalyses with too small accrued information size (sparse data) have an increased risk of either overestimating or underestimating the effect size and variance [25, 26, 44–46]. Therefore, without a required information size it will also be difficult to interpret a metaanalysis result indicating a difference in effect, i.e., an observed difference in effect might be caused by the random error due to the low information size.

There should be a low risk of turning a significant (for example, a Pvalue below 0.05) metaanalysis result into an insignificant result when future trials are included – ‘unstable’ metaanalysis results should be avoided for obvious reasons. It has been shown that a statistically significant metaanalysis based on too low information sizes, often at a later time point, will change from statistically significant to statistically nonsignificant [25, 26, 47].

The Cochrane Collaboration recommends that all systematic reviews are updated at least every second year [13], and there might be a lower chance of a review being updated if a metaanalysis shows significant results. If systematic review authors are allowed to assess statistical significance each time the review is updated without adjusting the level of statistical significance, premature declaration of effects will ensue and the risk of falsely rejecting the null hypothesis will increase (see also step 3 for a description of other problems with multiplicity) [14, 25, 26].
It is, therefore, of importance to estimate a required information size before conducting a metaanalysis. To estimate a required information size, it is necessary:

To estimate an anticipated intervention effect, i.e., to define a hypothesis alternative to the null hypothesis (for example, a mean difference, an odds ratio, or a hazard ratio [13, 48]). This hypothesised difference in effect between the compared intervention groups should be based on the most realistic intervention effect as suggested by prior evidence (for example, results from former randomised clinical trials, or known effects from other similar interventions [49–51]). As supplementary analyses, the point estimate from the metaanalysis on the outcome and the limit of the 95% confidence interval closest to no effect can be used as anticipated intervention effects.

To estimate a variance of the anticipated difference in intervention effect (for example, a standard deviation for a continuous outcome or a proportion of control participants with an event for a dichotomous outcome). Ideally this variance should be based on results from empirical data, for example, former systematic reviews, former randomised clinical trials, or large observational studies.

To estimate a variance of the intervention effect estimates between trials (D^{2} = the percentage that the betweentrial variability constitutes of the sum of the betweentrial variability and a sampling error estimate considering the required information size) [25, 26, 52]. If a randomeffects metaanalysis is chosen as the one of primary interest, it may be predefined that the observed between trial variance (empirical variance) will be used to calculate the required information size [47, 52]. However, if the observed heterogeneity is zero, it may not be unwise to use a heterogeneity of 25% [47]. If a fixedeffect metaanalysis is chosen as the one of primary interest, then the variance between trials should be zero (see step 1).

To decide on an acceptable risk of falsely rejecting the null hypothesis (alpha or type I error). As we have described in step 3, to adjust the risk of type I error according to the number of outcome comparisons we suggest dividing a prespecified Pvalue threshold with the value halfway between 1 and the number of primary outcome comparisons.

To decide on an acceptable risk of falsely confirming the null hypothesis (beta or type II error). We recommend 10% or 20%.
Required information sizes can easily be calculated using the program trial sequential analysis which together with a user manual can be downloaded for free at our website (http://www.ctu.dk/tsa/) [47]. The diversityadjusted required information size should be reported for all primary and secondary outcomes in the protocol.
Review authors might be tempted to calculate the diversityadjusted required information size based on unrealistically large anticipated intervention effects – large anticipated intervention effects lead to small required information sizes and the thresholds for significance will be less strict after the information size has been reached [25, 26]. This problematic incentive for using too large anticipated intervention effects to reduce the required information size might be counterbalanced by the use of a simple Bayes factor (see step 5).
Methods to adjust thresholds for significance if the required information size has not been reached
Trial sequential analysis [25, 26] or another valid sequential method [53] may be used to show if information size adjusted thresholds for significance are crossed or not. Trial sequential analysis uses the LanDeMets trial sequential monitoring boundaries based on a O’BrienFleming alfaspending function because the sample sizes of the included trials vary. The monitoring boundaries for benefit, harm, or futility show the adjusted thresholds for significance (adjusted thresholds of the confidence intervals and the Pvalues) if a required information size has not been reached [25, 26, 47, 54]. We recommend analysing all primary and secondary outcome comparisons with trial sequential analysis [25, 26, 47, 54] or another valid sequential method [25, 26, 53].
Fixedeffect and randomeffects trial sequential analysis
The trial sequential analyses should be performed using both a fixedeffect and a randomeffects model [14, 25, 26]. Analogous to the metaanalysis (see step 1), the trial sequential analysis with the highest Pvalue should be chosen as the primary. If a fixedeffect trial sequential analysis is of primary interest, then the variance between trials should be zero because the underlying assumption behind the fixedeffect model is that all included trials are estimating the same intervention effect. This also means that the calculation of the required information size should not be adjusted according to variance between trials if a fixedeffect metaanalysis is chosen as the one of primary interest.
If the diversityadjusted required information size is reached or surpassed
If the diversityadjusted required information size is reached, then the traditional thresholds for significance (95% confidence intervals not containing 1.00 for binary outcomes or 0.00 for continuous outcomes, and a corresponding Pvalue under 0.05) may be used unless these thresholds are adjusted according to problems with multiplicity (see step 3).
Step 5: Bayes factor
A low Pvalue indicates that an observed result is unlikely given the null hypothesis is true [12] – the Pvalue relates to the probability that there is no difference in effect between the compared interventions. Even a low Pvalue from a metaanalysis can be misleading if there is also a low probability that data are compatible with the anticipated intervention effect (see step 4). In other words, the probability that the actual measured difference in effect of the compared interventions resulted from an a priori anticipated ‘true’ difference needs to be considered. For this purpose it is helpful to calculate Bayes factors for the primary outcome/s (http://www.ctu.dk/toolsandlinks/bayesfactorcalculation.aspx) [55, 56]. Bayes factor is the ratio between the probability of the metaanalysis result given the null hypothesis (H_{0}) is true divided by the probability of the metaanalysis result given the alternative hypothesis (H_{A}) is true [12]. In the following, we have chosen to define the alternative hypothesis (H_{A}) as the anticipated intervention effect used in the calculation of the required information size in the trial sequential analysis (see step 4) [12], but Bayes factor can be defined differently [12]. Bayes factor as defined here may be calculated via the following formula (http://www.ctu.dk/toolsandlinks/bayesfactorcalculation.aspx):
μ_{A} = the intervention effect hypothesised in the estimation of the required information size (for example, a mean difference, a log odds ratio, a log relative risk, or a log hazard ratio).
$\widehat{\delta}$ = the intervention effect shown by the metaanalysis result (for example, a mean difference, a log odds ratio, a log relative risk, or a log hazard ratio). SE $\widehat{\delta}$ = standard error of $\widehat{\delta}$. Details about Bayes factor and relevant basic statistical methodology may be found elsewhere [12, 50].
A narrow confidence interval associated with a low Pvalue does not, as mentioned, necessarily correspond to a low Bayes factor – and a Pvalue less than 0.05 may therefore, in some circumstances, misleadingly indicate evidence for an intervention effect [50, 55]. A Pvalue less than 0.05 combined with a Bayes factor greater than 1.0 obtained from the observed data render more credit to the null hypothesis being true than the alternative hypothesis being true. A high Bayes factor will indicate that a metaanalysis result should be interpreted with caution, or, at least, indicate that the metaanalysis result is produced by an intervention effect that is lower than the anticipated intervention effect. A low Bayes factor together with a low Pvalue will correspond to a high probability of an intervention effect similar to or greater than the anticipated intervention effect used in the calculation of the required information size (Figure 1).
The Bayes factor threshold for significance
A Bayes factor less than 0.1 (a tenfold higher likelihood of compatibility with the alternative hypothesis than with the null hypothesis) may be chosen as threshold for significance [50]. However, spurious intervention effects caused by sparse data (see step 4) will also affect Bayes factor. Hence, if the required information size has not been reached, then even a Bayes factor less than 0.1 should be interpreted with caution [12, 57].
As mentioned in step 4, to reduce the required information size there is a risk of review authors using unrealistically large anticipated intervention effects in the calculation of the required information size. This problematic incentive will be counterbalanced by the use of Bayes factor because Bayes factor will increase if the metaanalysis shows an intervention effect smaller than the anticipated intervention effect (Figure 1). In other words, if unrealistically large intervention effects are anticipated, then observed unbiased data will often show smaller and more realistic intervention effects which will result in a relative high Bayes factor. The use of Bayes factor might be an incentive for a more realistic and smaller estimation of anticipated intervention effects, which will generally increase the validity of metaanalysis results and decrease the risk of metaanalysis, either overestimating or underestimating intervention effects. However, Bayes factor will still be misleading when an unrealistically large anticipated intervention effect is confirmed by ‘play of chance’, by an unrealistically large observed intervention effect [12, 50].
Step 6: the potential impact of systematic errors (‘bias’) on the metaanalysis results
Overall risk of bias
We have in step 3 to 5 described how to assess and take account of the risk of random error. However, it is of utmost importance also to assess the risk of systematic error (‘bias’) through subgroup analyses and sensitivity analyses [13]. Empirical evidence has repeatedly shown that trials with high risk of bias tend to overestimate benefits and underestimate harms [58–64]. The bias risk domains generation of allocation sequence, allocation concealment, blinding of participants and treatment providers, blinding of outcome assessors, incomplete outcome data (see paragraph below), selective outcome reporting, and industry funding have been shown to be of particular importance [13, 58–64]. A randomised clinical trial should, therefore, only be classified as overall ‘low risk of bias’ if all of the above mentioned bias components are assessed as ‘low risk of bias’ [13, 58–64]. The main conclusion of the systematic review ought to based on results of trials with low risk of bias – so, such an analysis should always be presented. A subgroup analysis should always be performed comparing the effects of trials with ‘low risk of bias’ to trials with ‘high risk of bias’.
The range of uncertainty due to the missing outcome data
If trial investigators have not used valid methods (for example, multiple imputation) to deal with missing data in the included randomised clinical trials [13], then there is a risk of biased review results [20, 65]. For example, if a certain group of participants are systematically lost to followup in only one of the compared intervention groups, then the review results might show a difference in effect between the compared groups due to attrition bias. If individual patient data are available, then multiple imputation might be used by the review authors, but even multiple imputation might lead to biased results if data are ‘missing not at random’ [65, 66].
For all metaanalyses, we recommend using at least two sensitivity analyses to assess the potential impact of the missing outcome data (risk of attrition bias) on the metaanalysis results [67]. The first sensitivity analysis is a ’bestworstcase’ scenario where it is assumed that all participants lost to followup in the experimental group have had a beneficial outcome (for example, had no serious adverse event); and all those with missing outcomes in the control group have had a harmful outcome (for example, have had a serious adverse event). The second sensitivity analysis is a ’worstbestcase’ scenario where it is assumed that all participants lost to followup in the experimental group have had a harmful outcome; and that all those lost to followup in the control group have had a beneficial outcome. If continuous outcomes are used, then a ‘beneficial outcome’ might be the group mean plus 2 standard deviations (or 1 standard deviation) of the group mean, and a ‘harmful outcome’ might be the group mean minus 2 standard deviations (or 1 standard deviation) of the group mean (see Additional file 1).
The results from both of these two extreme scenarios will most likely be unrealistic. However, these sensitivity analyses show the range of uncertainty due to missing data. The primary metaanalysis result may then be related to the results from the sensitivity analyses. If the primary metaanalysis result and the sensitivity analyses show similar confidence intervals and Pvalues, then the validity of the review results will increase considerably. If the primary metaanalysis results differ substantially from the results of the sensitivity analyses, then this will show that there is a risk of biased results due to attrition bias.
Step 7: publication bias
Systematic reviews aim to identify and include all randomised clinical trials addressing the question of the review [13]. However, trials with certain results might not be published for different reasons and this might consequently bias the review results [13]. Funnel plots can assess the risk of such ‘publication bias’ together with other bias [13].
A funnel plot is a simple scatter plot with the intervention effect estimates from the included trials (odds ratios and risk ratios should be plotted on a logarithmic scale) on the horizontal scale and the standard error of the intervention effect estimate on the vertical scale [13]. For example, if smaller trials without statistically significant effects remain unpublished, this will lead to an asymmetrical appearance of the funnel plot with a gap in the corner of the plot [13]. It is evident that such publication bias may tend to overestimate intervention effects [68].
Funnel plot should only be used to assess the risk of bias if at least 10 trials are included in the metaanalysis [13]. Funnel plots cannot be used to assess the risk of publication bias if the included trials are of similar size and some effect estimates are naturally correlated with their standard errors and therefore can produce spurious asymmetry in funnel plots [69]. Other types of bias or ‘true’ heterogeneity of the trial intervention effects might also produce asymmetry [13]. Contour lines corresponding to ‘milestones’ of statistical significance (for example, P = 0.05) can be added to the funnel plot and may help to differentiate between funnel plot asymmetry caused by publication bias and other factors [70]. For example, if trials appear to be missing in areas of statistical nonsignificance, then this may indicate that the asymmetry is due to publication bias. On the other hand, if the supposed missing trials are in areas of higher statistical significance, then the cause of the asymmetry may be due to factors other than publication bias. If there are no trials showing statistically significant effects, then publication bias may not be a plausible explanation for funnel plot asymmetry [13, 71].
A number of tests can assess funnel plot asymmetry (for example, Egger, Deeks, Harbourd, Begg) [13]. These tests should be interpreted in the light of visual inspection of the funnel plot and the results from these tests should be interpreted with caution. In general, the proposed tests have relatively low power [13].
The ‘trim and fill’ method can be used to assess the robustness of the review results. The trim and fill method aims to identify and correct funnel plot asymmetry arising from publication bias [13, 72]. The method removes the smaller trials causing funnel plot asymmetry, use the ‘trimmed’ funnel plot to estimate the true ‘centre’ of the funnel, and then includes both the removed trials and their ‘missing’ counterparts around the centre. The trim and fill method then provides an estimate of an intervention effect estimate adjusted for the assumed publication bias and the ‘missing’ trials are plotted. The trim and fill method assumes that the funnel plot asymmetry is caused by publication bias, but other factors (for example, heterogeneity) will often also cause or contribute to an asymmetry [13]. Therefore, the results from the trim and fill methods should be interpreted with caution and should primarily be hypothesis generating.
We have in this paragraph summarised The Cochrane Collaboration methodology to assess the risk of publication bias. For a more detailed description of assessments of publication bias, please consult The Cochrane Handbook for Systematic Reviews of Interventions [13].
Step 8: statistical significance and clinical significance
It will be impossible to interpret any review result without a thorough assessment of statistical significance and step 1 to 7 has described how to assess statistical significance. However, more than statistical significance is required before an intervention is declared as being effective, i.e., the size of the intervention effect must be clinically relevant in addition to the statistical significance. For clinically relevant outcomes, such as mortality, it is difficult to define a lower threshold for clinical significance. Any prevention, whatever small, of patientimportant outcomes may seem relevant [11, 12]. The clinical implications of statistically significant results on surrogate outcomes (‘indirectness’, in the GRADE system [16]), for example, serum levels of cholesterol or virological response, can often be questioned even with Pvalues far below 0.05 [11, 12]. Regardless of the type of outcome, small beneficial intervention effects will often not have any clinical relevance if adverse effects are taken into consideration [11, 12]. Even rare serious adverse effects may rule out the rational use of an otherwise beneficial intervention [73]. To assess the clinical significance of intervention effects it is important to perform a thorough assessment of the balance between beneficial and harmful effects [11, 12]. It is also important to relate the trial participants to a clinical population. Clinical implications of review results cannot be extrapolated to patient groups other than the patients included in the review [11]. Moreover, if a new intervention shows statistically significant effects but the size of the intervention effect is smaller compared to another intervention, then the new intervention effect might be considered as not clinically significant.
To avoid erroneous interpretations, assessment of the clinical significance of an intervention effect should only be assessed if statistical significance has been obtained, i.e., that the prior seven steps of our procedure have shown indications of a statistically significant result [74]. On the other hand, if statistical significance has been reached, then clinical significance must be assessed [11, 12, 16, 73, 74].
Preparing summary of findings tables according to the GRADE guidelines is an excellent way to demonstrate the clinical implications of review results [16–19] (Table 1). Reporting confidence intervals, ‘minimal clinical relevant differences’ for continuous outcomes, numbersneededtotreat for binary outcomes, and median survival times for survival data may also improve the clinical interpretability of review results [2, 11, 12, 16–19].
Recommendations
To assess the statistical and clinical significance of results from systematic reviews, we propose the following eightstep procedure:

1.
Calculate and report the confidence intervals and Pvalues from all fixedeffect and randomeffects metaanalyses. The most conservative result should be the main result.

2.
Explore the reasons behind substantial statistical heterogeneity by performing subgroup analyses and sensitivity analyses (see step 6).

3.
Adjust the thresholds for significance (Pvalues and the confidence intervals from the metaanalyses and the risks of type I error in the trial sequential analysis) according to the number of primary outcomes.

4.
Calculate and report realistic diversityadjusted required information sizes and analyse all primary and secondary outcomes in the review with trial sequential analysis. Report if the trial sequential monitoring boundaries for benefit, harm, or futility are crossed [25, 26]. The trial sequential analyses will adjust the confidence intervals and the thresholds for significance by relating the accrued data to the required information sizes [25, 26].

5.
Calculate and report Bayes factor for the primary outcome/s based on a prespecified anticipated intervention effect (same anticipated intervention effect as the one used to estimate the required information size) (http://www.ctu.dk/toolsandlinks/bayesfactorcalculation.aspx). A Bayes factor less than 0.1 (a tenfold higher likelihood of compatibility with the alternative hypothesis than with the null hypothesis) may be used as threshold for significance.

6.
Use subgroup analysis and sensitivity analyses to assess the potential impact of systematic errors (bias).

7.
Assess the risk of publication bias (funnel plot).

8.
Assess clinical significance of the review results if the prior seven steps have shown statistically significant results.
Table 2 summarises our suggestions for a more valid assessment of intervention effects in systematic reviews with metaanalytic methods, and we present an example of how the eightstep assessment can be used to assess statistical significance and clinical significance (see Additional file 1). For simplicity, we have only assessed the result of the primary outcome.
The eightstep procedure, The Cochrane Collaboration methodology, and the GRADE system
The eightstep procedure is designed to specify and assess the thresholds for significance in systematic reviews – the overall systematic review methodology should always be based on The Cochrane Handbook for Systematic Reviews of Interventions [2, 13]. The GRADE system provides a valid assessment of the quality of evidence in systematic reviews [2, 16–19]. Most of the methodological elements of the eightstep procedure, for example, trial sequential analyses, have already been used in several systematic reviews (for example, [75, 76]) but are neither described in The Cochrane Handbook for Systematic Reviews of Interventions nor the GRADE system. The eightstep procedure summarises the necessary methodology to assess the thresholds for significance and adds to the Cochrane methodology and the GRADE system: adjustments of the thresholds for significance according to problems with multiplicity and small information sizes, estimations of required information sizes, bestworst and worstbest case scenarios to assess the potential impact of missing data, and a calculation of Bayes factor. Table 1 summarises how GRADE may be used to assess the quality of the evidence [16], and we present an overview in Table 3 of how trial sequential analysis (step 4) may be used as a supplement for a more thorough assessment of ‘imprecision’ [77].
Discussion
We argue that a systematic review should be the prerogative and the incentive for conducting new trials and introducing new interventions into clinical practice. The systematic review ought be at the top of the hierarchy of evidence [6, 7, 11]. Nevertheless, there is a high risk of biased review results if reviews are not conducted with high methodological rigour. In order to avoid premature and erroneous conclusions, we have described the importance of valid thresholds for statistical and clinical significance in a systematic review, and we have also described a systematic stepwise methodological approach which will increase the methodological validity and quality of the interpretation of review results.
Strengths and limitations
The proposed eightstep procedure has strengths and limitations.
The eightstep procedure has the strength that it: (1) summarises The Cochrane Collaboration methodology related to the specification and assessment of the thresholds for significance in systematic reviews; (2) systematically adjusts the thresholds for significance according to the number of primary outcome comparisons and the fraction of the required information size that has been reached; (3) provides a likelihood ratio of the probability that a metaanalysis result is compatible with the null hypothesis divided by the probability that the result is compatible with an anticipated intervention effect; (4) limits the incentives for review authors both to overestimate and underestimate the anticipated intervention effect (see step 4 and step 5); (5) provides a more thorough assessment of the review results with a specific and elaborated evaluation of imprecision which may be used for a more accurate GRADE rating [16, 77]; and (6) forces investigators and consumers of systematically reviewed clinical research to judge clinical significance.
Our methodology has limitations.
First, our eightstep procedure is based on already wellestablished methodologies, but we lack large comparative studies comparing the use of the procedure to ‘usual practice’. We will address this issue in forthcoming articles.
Second, the pragmatic approach we recommend to be used for multiplicity adjustment (dividing 0.05 with the value halfway between 1 and the number of primary outcome comparisons) is not based on evidence. Our approach is based on the fact that the ‘true’ multiplicity adjusted threshold for significance lies between the unadjusted threshold and the Bonferroni adjusted threshold (see step 3). Nevertheless, most systematic reviewers do not adjust the thresholds for significance at all, which seems worse than both the conservative approach (Bonferroni adjustment) and our suggested pragmatic approach. As we have described in step 3, to calculate more precise adjustments of the thresholds for significance in systematic reviews an estimation of the correlation between the coprimary outcomes will be needed (see step 3). Such a correlation will often be unknown and erroneous assumptions about correlations might lead to erroneous results.
Third, the required information size, the trial sequential analysis, and the size of Bayes factor will all be highly dependent on the choice of the anticipated intervention effect which often will be difficult to quantify in advance. To reduce this limitation, it will often be warranted to perform sensitivity analyses using different estimations of the anticipated intervention effects. For example, as supplementary trial sequential analyses, the point estimate from the metaanalysis on the outcome and the limit of the 95% confidence interval closest to no effect can be used as the anticipated intervention effects; or an additional Bayes factor may be calculated using a smaller (‘sceptical’) anticipated intervention effect (for example, a relative risk halfway between the intervention effect anticipated in the calculation of the required information size and 1.0). The uncertainty related to the quantification of the anticipated intervention effects is a major limitation of our procedure as well as Bayesian analyses. To make the estimation of the anticipated intervention effects as optimal and as objective as possible, we recommend, as one option, to base the estimation of the anticipated intervention effect on former randomised clinical trials. If this is done, then the results from the trials used to estimate the anticipated intervention effects will probably be used again in the full review analysis, i.e., the estimation of the required information size take the form of an adaptive estimation [51, 78]. Hence, the risk of type I error will increase [51, 78]. Because of this risk of circular reasoning, ideally, it is, necessary to adjust the required information size by applying a penalty for the weight of data from former systematic reviews (or trials) [51, 78]. If the anticipated intervention effect estimate is based on, for example, an estimation of a ‘minimally clinically relevant’ intervention effect, then no adjustments will be needed. We acknowledge the theoretical need for such further adjustments, but then, the review analyses will become very complicated because the analysis methods would differ, depending on how the anticipated intervention effects are defined. Furthermore, our recommendations are already considerably tightening the thresholds for significance in systematic reviews, and if thresholds are too tight, there is a risk of ‘throwing the baby out with the bath water’.
The trial sequential analysis, the required information size, and Bayes factor may be influenced by posthoc adjustments and erroneous quantifications of the alternative hypothesis, especially if they have not been declared transparently in a protocol published before the systematic review begins. If the anticipated intervention effects are clearly defined in a published review protocol, and if predefined sensitivity analyses assess the uncertainty of the estimation of the anticipated intervention effects, then many of the problems with using anticipated intervention effect will be limited. The uncertainty of the estimation of the anticipated intervention effect is a challenge and a major limitation – but remains a necessary evil. A required information size and adjusted thresholds for significance cannot be calculated without estimations of an anticipated intervention effect.
We are sure that our eightstep procedure will be met with critical comments. Our recommendations will lead to fewer interventions which seem to offer benefits, the introduction of effective interventions in clinical practice may be delayed, and patient populations may stay without evidencebased interventions. We agree with such risks, but all things considered we do argue that a conservative scenario is more ethically defensible compared to the presentday practice of implementing interventions based on weak evidence [9, 25, 44, 45, 52, 79]. We argue that the advantages of the procedure outweigh the disadvantages. The health care researchers must deliver solid proof of more benefit than harms before interventions are introduced into clinical practice.
Conclusions
If the proposed eightstep procedure is followed, this may increase the validity of assessments of intervention effects in systematic reviews.
Ethical approval and consent
None needed.
Standards of reporting
There are no relevant standards of reporting for this type of paper.
Data availability
No additional data available.
References
 1.
Sun X, Guyatt G: Metaanalysis of randomized trials for health care interventions: one for all?. J Evid Based Med. 2009, 2 (1): 5356. 10.1111/j.17565391.2009.01006.x.
 2.
Murad MH, Montori VM, Ioannidis JP, Jaeschke R, Devereaux PJ, Prasad K, Neumann I, CarrascoLabra A, Agoritsas T, Hatala R, Meade MO, Wyer P, Cook DJ, Guyatt G: How to read a systematic review and metaanalysis and apply the results to patient care: users’ guides to the medical literature. JAMA. 2014, 312 (2): 171179. 10.1001/jama.2014.5559.
 3.
Stegenga J: Is metaanalysis the platinum standard of evidence?. Stud Hist Philos Biol Biomed Sci. 2011, 42 (4): 497507. 10.1016/j.shpsc.2011.07.003.
 4.
Hennekens CH, DeMets D: The need for largescale randomized evidence without undue emphasis on small trials, metaanalyses, or subgroup analyses. JAMA. 2009, 302 (21): 23612362. 10.1001/jama.2009.1756.
 5.
Borzak S, Ridker PM: Discordance between metaanalyses and largescale randomized, controlled trials. Examples from the management of acute myocardial infarction. Ann Intern Med. 1995, 123 (11): 873877. 10.7326/000348191231119951201000010.
 6.
Inthout J, Ioannidis JP, Borm GF: Obtaining evidence by a single wellpowered trial or several modestly powered trials. Stat Methods Med Res. 2012, Epub ahead of print
 7.
Borm GF, Lemmers O, Fransen J, Donders R: The evidence provided by a single trial is less reliable than its statistical analysis suggests. J Clin Epidemiol. 2009, 62 (7): 711715. 10.1016/j.jclinepi.2008.09.013. e711
 8.
Deeks JJ, Dinnes J, D’Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman DG: Evaluating nonrandomised intervention studies. Health Technol Assess. 2003, 7 (27): iiix. 1–173
 9.
Ioannidis J: Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005, 294 (2): 218228. 10.1001/jama.294.2.218.
 10.
Ioannidis JP, Haidich AB, Pappa M, Pantazis N, Kokori SI, Tektonidou MG, ContopoulosIoannidis DG, Lau J: Comparison of evidence of treatment effects in randomized and nonrandomized studies. JAMA. 2001, 286 (7): 821830. 10.1001/jama.286.7.821.
 11.
Jakobsen JC, Gluud C: The necessity of randomized clinical trials. Br J Med Res. 2013, 3 (4): 14531468. 10.9734/BJMMR/2013/3208.
 12.
Jakobsen JC, Gluud C, Winkel P, Lange T, Wetterslev J: The thresholds for statistical and clinical significance  a fivestep procedure for evaluation of intervention effects in randomised clinical trials. BMC Med Res Methodol. 2014, 14 (34): doi:10.1186/147122881434
 13.
Higgins JPT, Green S: The Cochrane Handbook for Systematic Reviews of Interventions, Version 5.1.0. The Cochrane Collaboration. 2011, Available from http://www.cochranehandbook.org
 14.
Keus F, Wetterslev J, Gluud C, van Laarhoven CJ: Evidence at a glance: error matrix approach for overviewing available evidence. BMC Med Res Methodol. 2010, 10: 9010.1186/147122881090.
 15.
Clarke M, Brice A, Chalmers I: Accumulating research: a systematic account of how cumulative metaanalyses would have provided knowledge, improved health, reduced harm and saved resources. PLoS One. 2014, 9 (7): e10267010.1371/journal.pone.0102670.
 16.
Agoritsas T, Guyatt GH: Evidencebased medicine 20 years on: a view from the inside. Can J Neurol Sci. 2013, 40 (4): 448449.
 17.
Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, Norris S, FalckYtter Y, Glasziou P, DeBeer H, Jaeschke R, Rind D, Meerpohl J, Dahm P, Schunemann HJ: GRADE guidelines: 1. IntroductionGRADE evidence profiles and summary of findings tables. J Clin Epidemiol. 2011, 64 (4): 383394. 10.1016/j.jclinepi.2010.04.026.
 18.
Guyatt G, Oxman AD, Sultan S, Brozek J, Glasziou P, AlonsoCoello P, Atkins D, Kunz R, Montori V, Jaeschke R, Rind D, Dahm P, Akl EA, Meerpohl J, Vist G, Berliner E, Norris S, FalckYtter Y, Schunemann HJ: GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes. J Clin Epidemiol. 2013, 66 (2): 151157. 10.1016/j.jclinepi.2012.01.006.
 19.
Guyatt GH, Thorlund K, Oxman AD, Walter SD, Patrick D, Furukawa TA, Johnston BC, Karanicolas P, Akl EA, Vist G, Kunz R, Brozek J, Kupper LL, Martin SL, Meerpohl JJ, AlonsoCoello P, Christiansen R, Schunemann HJ: GRADE guidelines: 13. Preparing summary of findings tables and evidence profilescontinuous outcomes. J Clin Epidemiol. 2013, 66 (2): 173183. 10.1016/j.jclinepi.2012.08.001.
 20.
Bjelakovic G, Gluud Lise L, Nikolova D, Whitfield K, Wetterslev J, Simonetti Rosa G, Bjelakovic M, Gluud C: Vitamin D supplementation for prevention of mortality in adults. Cochrane Database Syst Rev. 2014, 1: CD007470
 21.
DerSimonian R, Laird N: Metaanalysis in clinical trials. Control Clin Trials. 1986, 7 (3): 177188. 10.1016/01972456(86)900462.
 22.
IntHout J, Ioannidis JP, Borm GF: The HartungKnappSidikJonkman method for random effects metaanalysis is straightforward and considerably outperforms the standard DerSimonianLaird method. BMC Med Res Methodol. 2014, 14: 2510.1186/147122881425.
 23.
Higgins J, Thompson S, Deeks J, Altman D: Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice. J Health Serv Res Policy. 2002, 7 (1): 5161. 10.1258/1355819021927674.
 24.
Higgins JP, Spiegelhalter DJ: Being sceptical about metaanalyses: a Bayesian perspective on magnesium trials in myocardial infarction. Int J Epidemiol. 2002, 31 (1): 96104. 10.1093/ije/31.1.96.
 25.
Brok J, Thorlund K, Gluud C, Wetterslev J: Trial sequential analysis reveals insufficient information size and potentially false positive results in many metaanalysis. J Clin Epidemiol. 2008, 61: 763769. 10.1016/j.jclinepi.2007.10.007.
 26.
Wetterslev J, Thorlund K, Brok J, Gluud C: Trial sequential analysis may establish when firm evidence is reached in cumulative metaanalysis. J Clin Epidemiol. 2008, 61 (1): 6475. 10.1016/j.jclinepi.2007.03.013.
 27.
Rücker G, Schwarzer G, Carpenter JR, Schumacher M: Undue reliance on I(2) in assessing heterogeneity may mislead. BMC Med Res Methodol. 2008, 8: 7910.1186/14712288879.
 28.
Berlin JA, Crowe BJ, Whalen E, Xia HA, Koro CE, Kuebler J: Metaanalysis of clinical trial safety data in a drug development program: answers to frequently asked questions. Clin Trials. 2013, 10 (1): 2031. 10.1177/1740774512465495.
 29.
Bender R, Bunce C, Clarke M, Gates S, Lange S, Pace NL, Thorlund K: Attention should be given to multiplicity issues in systematic reviews. J Clin Epidemiol. 2008, 61 (9): 857865. 10.1016/j.jclinepi.2008.03.004.
 30.
The European Agency for the Evaluation of Medical Products: Points to consider on multiplicity issues in clinical trials. Eval Med Human Use. 2002, CPMP/EWP/908/99
 31.
Imberger G, Vejlby AD, Hansen SB, Møller AM, Wetterslev J: Statistical multiplicity in systematic reviews of anaesthesia interventions: a quantification and comparison between Cochrane and nonCochrane reviews. PLoS One. 2011, 6: e2842210.1371/journal.pone.0028422.
 32.
Schulz KF, Altman DG, Moher D: CONSORT 2010 statement: updated guidelines for reporting parallel group randomized trials. Ann Int Med. 2010, 152 (11): 726732. 10.7326/000348191521120100601000232.
 33.
Equator Network: Enhancing the QUAlity and transparency of health research. 2014, Available at: http://wwwequatornetworkorg/
 34.
Abdi H: The Bonferonni and Šidák corrections for multiple comparisons. Encyclopedia of Measurement and Statistics Thousand Oaks (CA): Sage. Edited by: Salkind NJ. 2007, 103107.
 35.
Williamson PR, Altman DG, Blazeby JM, Clarke M, Gargon E: The COMET (Core Outcome Measures in Effectiveness Trials) Initiative. Trials. 2011, 12 (Suppl 1): A7010.1186/1745621512S1A70.
 36.
Altman DG, Bland JM: How to obtain the P value from a confidence interval. BMJ. 2011, 343: d230410.1136/bmj.d2304.
 37.
Mehrotra DV, Heyse JF: Use of the false discovery rate for evaluating clinical safety data. Stat Methods Med Res. 2004, 13 (3): 227238.
 38.
Berry SM, Berry DA: Accounting for multiplicities in assessing drug safety: a threelevel hierarchical mixture model. Biometrics. 2004, 60 (2): 418426. 10.1111/j.0006341X.2004.00186.x.
 39.
Xia HA, Ma H, Carlin BP: Bayesian hierarchical modeling for detecting safety signals in clinical trials. J Biopharm Stat. 2011, 21 (5): 10061029. 10.1080/10543406.2010.520181.
 40.
Higgins JP, Thompson SG, Spiegelhalter DJ: A reevaluation of randomeffects metaanalysis. J R Stat Soc Ser A Stat Soc. 2009, 172 (1): 137159. 10.1111/j.1467985X.2008.00552.x.
 41.
Wei Y, Higgins JP: Bayesian multivariate metaanalysis with multiple outcomes. Stat Med. 2013, 32 (17): 291134. 10.1002/sim.5745.
 42.
Harmonised Tripartite Guideline ICH: Statistical principles for clinical trials. International Conference on Harmonisation E9 Expert Working Group. Stat Med. 1999, 18 (15): 19051942.
 43.
Turner RM, Bird SM, Higgins JP: The impact of study size on metaanalyses: examination of underpowered studies in Cochrane reviews. PLoS One. 2013, 8 (3): e5920210.1371/journal.pone.0059202.
 44.
Pereira TV, Horwitz RI, Ioannidis JP: Empirical evaluation of very large treatment effects of medical interventions. JAMA. 2012, 308: 16761684. 10.1001/jama.2012.13444.
 45.
Thorlund K, Devereaux PJ, Wetterslev J, Guyatt G, Ioannidis JP, Thabane L, Gluud LL, AlsNielsen B, Gluud C: Can trial sequential monitoring boundaries reduce spurious inferences from metaanalyses?. Int J Epidemiol. 2009, 38 (1): 276286. 10.1093/ije/dyn179.
 46.
Thorlund K, Imberger G, Walsh M, Chu R, Gluud C, Wetterslev J, Guyatt G, Devereaux PJ, Thabane L: The number of patients and events required to limit the risk of overestimation of intervention effects in metaanalysis – a simulation study. PLoS One. 2011, 6: e2549110.1371/journal.pone.0025491.
 47.
Thorlund K, Engstrøm J, Wetterslev J, Brok J, Imberger G, Gluud C: User manual for trial sequential analysis (TSA). Copenhagen Trial Unit, Centre for Clinical Intervention Research. 2011, Copenhagen Denmark, 1115. Available from http://www.ctu.dk/tsa
 48.
Chow SC: Sample size calculations for clinical trials. WIREs Comp Stat. 2011, 3: 414427. 10.1002/wics.155.
 49.
Pogue JM, Yusuf S: Cumulating evidence from randomized trials: utilizing sequential monitoring boundaries for cumulative metaanalysis. Control Clin Trials. 1997, 18 (6): 580593. 10.1016/S01972456(97)000512.
 50.
Spiegelhalter DJ, Abrams KR, Myles JP: Bayesian approaches to clinical trials and healthcare evaluation (Statistics in Practice). 2004, West Sussex: John Wiley & Sons Ltd
 51.
Roloff V, Higgins JP, Sutton AJ: Planning future studies based on the conditional power of a metaanalysis. Stat Med. 2013, 32 (1): 1124. 10.1002/sim.5524.
 52.
Wetterslev J, Thorlund K, Brok J, Gluud C: Estimating required information size by quantifying diversity in randomeffects model metaanalyses. BMC Med Res Methodol. 2009, 9: 8610.1186/14712288986.
 53.
Higgins JP, Whitehead A, Simmonds M: Sequential methods for randomeffects metaanalysis. Stat Med. 2011, 30 (9): 903921. 10.1002/sim.4088.
 54.
Pocock SJ: When to stop a clinical trial. BMJ. 1992, 305 (6847): 235240. 10.1136/bmj.305.6847.235.
 55.
Goodman SN: Introduction to Bayesian methods I: measuring the strength of evidence. Clin Trials. 2005, 2: 282378. 10.1191/1740774505cn098oa.
 56.
Goodman SN: Toward evidencebased medical statistics. 2: The Bayes factor. Ann Intern Med. 1999, 130 (12): 10051013. 10.7326/000348191301219990615000019.
 57.
Korn EL, Freidlin B: The likelihood as statistical evidence in multiple comparisons in clinical trials: no free lunch. Biom J. 2006, 48 (3): 346355. 10.1002/bimj.200510216.
 58.
Savovic J, Jones HE, Altman DG, Harris RJ, Juni P, Pildal J, AlsNielsen B, Balk EM, Gluud C, Gluud LL, Ioannidis JP, Schulz KF, Beynon R, Welton NJ, Wood L, Moher D, Deeks JJ, Sterne JA: Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials. Ann Intern Med. 2012, 157 (6): 429438. 10.7326/00034819157620120918000537.
 59.
Schulz KF, Chalmers I, Hayes RJ, Altman DG: Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995, 273 (5): 408412. 10.1001/jama.1995.03520290060030.
 60.
Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF: Improving the quality of reports of metaanalyses of randomised controlled trials: the QUOROM statement. Quality of Reporting of Metaanalyses. Lancet. 1999, 354 (9193): 18961900. 10.1016/S01406736(99)041495.
 61.
Kjaergaard L, Villumsen J, Gluud C: Reported methodologic quality and discrepancies between large and small randomized trials in metaanalysis. Ann Int Med. 2001, 135 (11): 982989. 10.7326/000348191351120011204000010.
 62.
Gluud LL, Thorlund K, Gluud C, Woods L, Harris R, Sterne JA: Correction: reported methodologic quality and discrepancies between large and small randomized trials in metaanalyses. Ann Int Med. 2008, 149 (3): 219
 63.
Wood L, Egger M, Gluud LL, Schulz KF, Juni P, Altman DG, Gluud C, Martin RM, Wood AJ, Sterne JA: Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: metaepidemiological study. BMJ. 2008, 336 (7644): 601605. 10.1136/bmj.39465.451748.AD.
 64.
Lundh A, Sismondo S, Lexchin J, Busuioc OA, Bero L: Industry sponsorship and research outcome. Cochrane Database Syst Rev. 2012, 12: MR000033
 65.
Sterne JAC, White IA, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009, 338 (b2393):
 66.
Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, Neaton JD, Rotnitzky A, Scharfstein D, Shih WJ, Siegel JP, Stern H: The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012, 367 (14): 13551360. 10.1056/NEJMsr1203730.
 67.
Carpenter J, Kenward M: Multiple Imputation and its Application. 2013, West Sussex: John Wiley & Sons Ltd
 68.
Egger M, Davey Smith G, Schneider M, Minder C: Bias in metaanalysis detected by a simple, graphical test. BMJ. 1997, 315 (7109): 629634. 10.1136/bmj.315.7109.629.
 69.
Higgins J: onvincing evidence from controlled and uncontrolled studies on the lipidlowering effect of a statin [editorial]. Cochrane Database Syst Rev. 2012, 12 (ED000049): http://www.thecochranelibrary.com/details/editorial/3992221/Convincingevidencefromcontrolledanduncontrolledstudiesonthelipidloweri.html,
 70.
Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L: Contourenhanced metaanalysis funnel plots help distinguish publication bias from other causes of asymmetry. J Clin Epidemiol. 2008, 61 (10): 991996. 10.1016/j.jclinepi.2007.11.010.
 71.
Ioannidis JP, Trikalinos TA: The appropriateness of asymmetry tests for publication bias in metaanalyses: a large survey. CMAJ. 2007, 176 (8): 10911096. 10.1503/cmaj.060410.
 72.
Duval S, Tweedie R: Trim and fill: A simple funnelplotbased method of testing and adjusting for publication bias in metaanalysis. Biometrics. 2000, 56 (2): 455463. 10.1111/j.0006341X.2000.00455.x.
 73.
Jüni P, Nartey L, Reichenbach S, Sterchi R, Dieppe PA, Egger M: Risk of cardiovascular events and rofecoxib: cumulative metaanalysis. Lancet. 2004, 364 (9450): 20212029. 10.1016/S01406736(04)175144.
 74.
Wood J, Freemantle N, King M, Nazareth I: Trap of trends to statistical significance: likelihood of near significant P value becoming more significant with extra data. BMJ. 2014, 348: g221510.1136/bmj.g2215.
 75.
Bolland MJ, Grey A, Gamble GD, Reid IR: The effect of vitamin D supplementation on skeletal, vascular, or cancer outcomes: a trial sequential metaanalysis. Lancet Diab Endocrinol. 2014, 2 (4): 307320. 10.1016/S22138587(13)702122.
 76.
Bjelakovic G, Nikolova D, Gluud LL, Simonetti RG, Gluud C: Antioxidant supplements for prevention of mortality in healthy participants and patients with various diseases. Cochrane Database Syst Rev. 2012, 3: CD007176
 77.
Guyatt GH, Oxman AD, Kunz R, Brozek J, AlonsoCoello P, Rind D, Devereaux PJ, Montori VM, Freyschuss B, Vist G, Jaeschke R, Williams JW, Murad MH, Sinclair D, FalckYtter Y, Meerpohl J, Whittington C, Thorlund K, Andrews J, Schunemann HJ: GRADE guidelines 6. Rating the quality of evidenceimprecision. J Clin Epidemiol. 2011, 64 (12): 12831293. 10.1016/j.jclinepi.2011.01.012.
 78.
Bauer P, Kohne K: Evaluation of experiments with adaptive interim analyses. Biometrics. 1994, 50 (4): 10291041. 10.2307/2533441.
 79.
Ioannidis JPA: Why most published research findings are false. PLoS Med. 2005, 2 (8): e12410.1371/journal.pmed.0020124.
Prepublication history
The prepublication history for this paper can be accessed here:http://0www.biomedcentral.com.brum.beds.ac.uk/14712288/14/120/prepub
Acknowledgements
We thank Dimitrinka Nikolova for her patient copyediting and suggestions. This study was partly funded by The Copenhagen Trial Unit.
Author information
Additional information
Competing interests
JW and CG have been a part of the task force, which has developed the trial sequential analysis. Otherwise the authors declare that they have no competing interests.
Authors’ contributions
JCJ wrote the first draft. TL contributed with statistical expertise. All authors (JCJ, JW, PW, TL, CG) contributed academically to the manuscript and have accepted the manuscript for publication.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under an open access license. Please check the 'Copyright Information' section for details of this license and what reuse is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and reuse information, please contact the Rights and Permissions team.
About this article
Cite this article
Jakobsen, J.C., Wetterslev, J., Winkel, P. et al. Thresholds for statistical and clinical significance in systematic reviews with metaanalytic methods. BMC Med Res Methodol 14, 120 (2014). https://0doiorg.brum.beds.ac.uk/10.1186/1471228814120
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/1471228814120
Keywords
 Intervention Effect Estimate
 Funnel Plot Asymmetry
 Review Result
 Trial Sequential Analysis
 Require Information Size