Group sequential designs in pragmatic trials: feasibility and assessment of utility using data from a number of recent surgical RCTs

Parsons, Nick R.; Stallard, Nigel; Parsons, Helen; Haque, Aminul; Underwood, Martin; Mason, James; Khan, Iftekhar; Costa, Matthew L.; Griffin, Damian R.; Griffin, James; Beard, David J.; Cook, Jonathan A.; Davies, Loretta; Hudson, Jemma; Metcalfe, Andrew

doi:10.1186/s12874-022-01734-2

Research
Open access
Published: 01 October 2022

Group sequential designs in pragmatic trials: feasibility and assessment of utility using data from a number of recent surgical RCTs

Nick R. Parsons¹,
Nigel Stallard¹,
Helen Parsons²,
Aminul Haque²,
Martin Underwood^2,3,
James Mason²,
Iftekhar Khan²,
Matthew L. Costa⁴,
Damian R. Griffin²,
James Griffin²,
David J. Beard⁴,
Jonathan A. Cook⁵,
Loretta Davies⁴,
Jemma Hudson⁶ &
…
Andrew Metcalfe^2,3

BMC Medical Research Methodology volume 22, Article number: 256 (2022) Cite this article

2511 Accesses
3 Citations
9 Altmetric
Metrics details

Abstract

Background

Assessing the long term effects of many surgical interventions tested in pragmatic RCTs may require extended periods of participant follow-up to assess effectiveness and use patient-reported outcomes that require large sample sizes. Consequently the RCTs are often perceived as being expensive and time-consuming, particularly if the results show the test intervention is not effective. Adaptive, and particularly group sequential, designs have great potential to improve the efficiency and cost of testing new and existing surgical interventions. As a means to assess the potential utility of group sequential designs, we re-analyse data from a number of recent high-profile RCTs and assess whether using such a design would have caused the trial to stop early.

Methods

Many pragmatic RCTs monitor participants at a number of occasions (e.g. at 6, 12 and 24 months after surgery) during follow-up as a means to assess recovery and also to keep participants engaged with the trial process. Conventionally one of the outcomes is selected as the primary (final) outcome, for clinical reasons, with others designated as either early or late outcomes. In such settings, novel group sequential designs that use data from not only the final outcome but also from early outcomes at interim analyses can be used to inform stopping decisions. We describe data from seven recent surgical RCTs (WAT, DRAFFT, WOLLF, FASHION, CSAW, FIXDT, TOPKAT), and outline possible group sequential designs that could plausibly have been proposed at the design stage. We then simulate how these group sequential designs could have proceeded, by using the observed data and dates to replicate how information could have accumulated and decisions been made for each RCT.

Results

The results of the simulated group sequential designs showed that for two of the RCTs it was highly likely that they would have stopped for futility at interim analyses, potentially saving considerable time (15 and 23 months) and costs and avoiding patients being exposed to interventions that were either ineffective or no better than standard care. We discuss the characteristics of RCTs that are important in order to use the methodology we describe, particularly the value of early outcomes and the window of opportunity when early stopping decisions can be made and how it is related to the length of recruitment period and follow-up.

Conclusions

The results for five of the RCTs tested showed that group sequential designs using early outcome data would have been feasible and likely to provide designs that were at least as efficient, and possibly more efficient, than the original fixed sample size designs. In general, the amount of information provided by the early outcomes was surprisingly large, due to the strength of correlations with the primary outcome. This suggests that the methods described here are likely to provide benefits more generally across the range of surgical trials and more widely in other application areas where trial designs, outcomes and follow-up patterns are structured and behave similarly.

Peer Review reports

Background

Pragmatic clinical trials, that test interventions in everyday (routine practice) settings, typically have a number of important distinguishing characteristics that in large part determine their design and implementation [1, 2]. Primary amongst these are that they require large sample sizes (due to heterogeneity in the study population and interventions) and long follow-up periods in to order to assess effectiveness. One of the most important application areas for pragmatic trials is the assessment of surgical interventions [3, 4]; i.e. trials that involved a surgical intervention, or immediate postoperative intervention (e.g. wound management), in one or more arms of a study. Such interventions have historically been introduced based solely on what a surgeon believes might benefit patients; the perceived lack of rigour and inefficiency in surgical trials has motivated the development of many new processes and methodologies [5,6,7], and a consequent steady increase in the number of large randomised controlled trials (RCTs) over the last ten years. Many of the late-stage clinical trials testing surgical interventions are in Trauma and Orthopaedics (T&O). These trials are large, often because they use patient reported outcomes (PROMs) [2, 8], may take many (e.g. more than 5) years to complete due the long follow-up required and are consequently expensive. In order to improve both the efficiency and cost of testing new and existing surgical interventions, adaptive, and particularly group sequential, designs may have enormous potential and present an exciting opportunity for future research.

The 2022 START:REACTS clinical trial (Subacromial spacer for Tears Affecting Rotator cuff Tendons: a Randomised, Efficient, Adaptive Clinical Trial in Surgery)[9], stopped early and used a novel group sequential design, originally proposed by Parsons et al. [10] as a means to undertake clinical trials in a much more flexible and efficient manner, whilst retaining trial integrity. The approach proposed in the paper by Parsons et al. exploited the fact that it is very common in surgical trials to routinely monitor participants (often remotely) at a number of fixed occasions prior to collecting the definitive (final) study outcome (e.g. early outcomes might be collected at 3 and 6 months, prior to the main 12 month time-point). In such settings, if an interim analysis uses information from only those participants with final outcome data, then the opportunities for early stopping are likely to be limited simply by time constraints, as often trial recruitment will have completed prior to sufficient final outcome data being available for stopping decisions to be made. However, if the early outcomes are correlated with the final outcome, then a group sequential analysis [11] which uses the totality of information available from both early and final outcomes to estimate the treatment effect at the final study endpoint is likely to make adaptive designs feasible and lead to increases in statistical power [12,13,14].

Historically group sequential designs have not been used much, if at all, in surgical trials; a 2015 study [15] reported that only 1% of group sequential randomised controlled trials in peer-reviewed journals used a surgical intervention (60% used drugs, with the majority of RCTs in oncology). This, in part, reflects the fact that surgical trials have been behind other application areas in terms of the amount of rigorous research undertaken and the sophistication of the research methods employed. However, things have changed considerably in recent years with many more active research groups in the UK (where we are based) and around the world reporting the results of large (multicentre) RCTs in high-impact medical journals.

There are some universal barriers to the uptake of adaptive design methods that exist across all medical specialties, in particular the lack of knowledge, training and statistical expertise in research teams and more general anxiety about the impact of early stopping [16,17,18], which we will not address here. Provided we can overcome these more general barriers to uptake and specific concerns and issues around the appropriate methods to use and how they should be implemented, adaptive designs will likely become important and widely used methods in surgical RCTs. The group sequential design approach of Parsons et al. [10], was described in the context of a study (START:REACTS) comparing two treatment arms with two early outcome measures. The UK National Institute of Health Research (NIHR) funded study team (Efficacy and Mechanism Evaluation (EME) Programme, project reference 16.61.18) that undertook the START:REACTS study also investigated how the group sequential design methods used in the START:REACTS study might have been implemented and whether they would have resulted in changes in trial length and decision making in a number of recently undertaken high-profile conventional (fixed design) surgical trials in T&O. The main aims of this work were primarily to explore the generalisability of the methodological approach utilised in START:REACTS and assess whether this approach would have resulted in early stopping, using the original time sequence of patient recruitment data in these fixed design trials. This work is reported here using anonymized patient reported outcome data from seven T&O RCTs made available by Warwick Clinical Trials Unit (WCTU, Warwick Medical School; https://warwick.ac.uk/fac/sci/med/research/ctu/) and NDORMS, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (University of Oxford; https://www.ndorms.ox.ac.uk/).

A frequentist approach to the group sequential design is used here, defined by the error spent at each look with pre-defined information levels [11, 19, 20]. Bayesian methods are also widely available for adaptive group sequential designs [21], and have previously been suggested for trials in T&O and emergency medicine, albeit in very different applications to those presented here [22, 23]. Predominantly trials in T&O use patient reported outcome measures (PROMs), as the primary outcomes, which are typically assumed to be approximately normally distributed for the purposes of analysis. This is also the approach adopted here, as all the selected RCTs used PROMs; we do not discuss other (e.g. binary) outcome measures. However, we believe that the methods discussed could very easily be adapted for other outcomes and more generally, although the focus here is on surgery (and T&O), for trials in other application areas where settings (e.g. outcomes, patterns of follow-up and recruitment) are similar.

Methods

Data

Trial data from seven RCTs (see Table 1) were selected as typical of many recent surgical trials, in terms of the sample size, recruitment and participant follow-up periods and division of resources between primary and early outcome measures and for the pragmatic reason that the study principal investigators were able to respond quickly and positively to data sharing requests, and published protocols were available for all the studies. Unadjusted estimates of treatment effects and other key features are now described for each RCT; more detailed descriptions are available in an additional file [see Additional file 1]. We have chosen to use the data from trials unchanged, with the same sample sizes, for reasons of simplicity and ease of interpretation, rather than inflate the sample sizes as would be conventional to retain study power. In practice, we would typically increase the sample size, dependent on the number of planned interim analyses and stopping probabilities (see Boundaries and information monitoring), by a small amount to allow for the possible adaptations. Typically, this would make only a small or moderate change to the study sample size; for instance, in the 2022 START:REACTS clinical trial [9], the sample was increased from 170 to 188 (before allowance for missing data), to retain power at 90%.

Table 1 Brief details of selected RCTs

Full size table

WAT

The Warwick Arthroplasty Trial (WAT) was a two arm, parallel group, RCT conducted in the UK [24, 25], recruiting \(N=126\) patients, between May 2007 and February 2010, suitable for a resurfacing arthroplasty of the hip. Patients were randomly assigned on a 1:1 basis to receive either a total hip arthroplasty (THA) or a resurfacing arthroplasty (RSA). The primary outcome was hip function, as measured by the patient-reported Oxford Hip Score (OHS; scale 0 to 48, with 48 representing no pain and perfect function) at 12 months (12m) after operation, with early outcome assessed at 6 weeks (6w), 3 months (3m) and 6 months (6m). The main result of the study was that there was no statistically significant difference in OHS between groups at 12 months; the mean score in the RSA group was 40.4 (\(N1_{12m}=57\)) and in the THA group 38.2 (\(N0_{12m}=63\)), a difference of 2.2 (95%CI; \(-0.5\) to 12.6).

DRAFFT

The Distal Radius Acute Fracture Fixation Trial (DRAFFT) compared Kirschner wire fixation (Wire) with volar locking plate fixation (Plate) for \(N=461\) patients with a dorsally displaced fracture of the distal radius recruited between July 2010 to July 2012 and randomised on a 1:1 basis [26, 27]. The trial used the Patient Rated Wrist Evaluation (PRWE; scale 0 to 100, with 100 being the worst score) score at 12 months (12m) after surgery to assess participants, with early assessments at 3 and 6 months. The main result of the study was that there was no statistically significant difference in PRWE score between groups at 12 months; the mean score in the Wire group was 15.3 (\(N0_{12m}=211\)) and in the Plate group 13.9 (\(N1_{12m}=204\)), a difference of 1.4 (95%CI; \(-1.8\) to 4.5).

WOLLF

The Wound management of Open Lower Limb Fractures (WOLLF) trial was a multi-centre randomized trial performed in the UK Major Trauma Network, recruiting \(N=460\) patients with a severe open fracture of the lower limb from July 2012 to December 2015 [28, 29]. Participants were randomized on a 1:1 basis to either negative pressure wound therapy (NPWT) or standard (Standard) wound management. The primary outcome of the study was the Disability Rating Index (DRI) score (range, 0 = no disability to 100 = completely disabled) at 12 months (12m), with early outcomes measured at 3, 6 and 9 months. The main result of the study was that there was no statistically significant difference in DRI score between groups at 12 months; the mean score in the NPWT group was 45.5 (\(N1_{12m}=179\)) and in the standard dressing group 42.4 (\(N0_{12m}=195\)), a difference of \(-3.1\) (95%CI; \(-8.5\) to 2.2).

FASHION

The Full UK RCT of Arthroscopic surgery for Hip Impingement versus best cONservative care trial (FASHION) was a pragmatic, multicentre, RCT recruiting \(N=348\) adult patients, between July 2012 and July 2016, with femoroacetabular impingement syndrome, who were randomly allocated on a 1:1 basis to receive either hip arthroscopic surgery (Surgery) or personalised hip therapy (PHT) and followed-up at 6 and 12 months [30,31,32]. The primary outcome was the patient-reported International Hip Outcome Tool (iHOT-33; scale 0 to 100, with 100 representing no pain and perfect function) at 12 months after randomisation, with early outcome assessed at 6 months. The primary result of the study was that there was a statistically significant difference in iHOT-33 score between groups at 12 months; the mean score in the Surgery group was 58.8 (\(N1_{12m}=158\)) and in the PHT group 49.7 (\(N0_{12m}=163\)), a difference of 9.1 (95%CI; 3.3 to 14.9).

CSAW

The Can Shoulder Arthroscopy Work (CSAW) trial was a three arm trial, but we limit discussion here to the main treatment comparison. The CSAW RCT randomized \(N=210\) participants (on a 1:1 basis), from September 2012 to June 2015, to either Arthroscopic SubAcromial Decompression (ASAD) or Active Monitoring with Specialist Reassessment (AMSR; no surgical treatment) and used the Oxford Shoulder Score (OSS; scale 0 to 48, with 0 being the worst score) at 6 months after randomisation to assess outcomes [33, 34]. OSS was also assessed at 12 months after randomisation, but no early assessment of OSS was made before the 6 months primary endpoint. The primary result of the study was that there was no statistically (or clinically) significant difference in OSS at 6 months between groups; the mean score in the ASAD group was 32.7 (\(N1_{6m}=90\)) and in the AMSR group 29.4 (\(N0_{6m}=90\)), a difference of 3.3 (95%CI; \(-0.2\) to 6.8).

FIXDT

The FIXation of Distal Tibia fractures (FIXDT) trial recruited \(N=321\) patients between April 2013 and April 2016 and compared intramedullary nail fixation (Nail) with locking plate fixation (Plate) for adult patients with a displaced fracture of the distal tibia [35, 36] using the Disability Rating Index (DRI; range 0 to 100, with 100 being completely disabled) at 6 months (6m), with early outcome measured at 3 months (3m) and long-term outcome assessed at 12 months. The primary result of the study was that there was no statistically significant difference in DRI score between groups at 6 months; the mean score in the Nail group was 29.8 (\(N1_{6m}=142\)) and in the Plate group 33.8 (\(N0_{6m}=140\)), a difference of \(-4.0\) (95%CI; \(-9.6\) to 1.6).

TOPKAT

The Total Or Partial Knee Arthroplasty Trial (TOPKAT) randomized \(N=528\) participants (on a 1:1 basis) from January 2010 to September 2013 and compared total knee replacement (TKR) to partial knee replacement (PKR) for patients with medial compartment osteoarthritis of the knee using the Oxford Knee Score (OKS; scale 0 to 48, with 0 being the worst score) at 5 years (5y) after randomisation with early outcomes assessed at 2 months (2m) and on a yearly basis at 1, 2, 3 and 4 years [37, 38]. The primary result of the study was that there was no statistically significant difference in OKS between groups at 5 years; the mean score in the TKR group was 37.0 (\(N0_{5y}=231\)) and in the PKR group 38.0 (\(N1_{5y}=233\)), a difference of 1.0 (95%CI; \(-0.4\) to 2.5).

Adaptive group sequential designs

Overview

This study assesses whether the RCTs described here, which were originally implemented using conventional fixed sample size designs, would have stopped early if an adaptive (group sequential) trial design had been used. For the purposes of this work, all the selected RCTs had two treatment arms (with one nominally designated as the control or standard treatment), randomized participants to treatment groups in a 1:1 ratio and reported a single primary outcome, with one or more assessments of the trial outcome measure (e.g. outcomes at 3 and 6 months or 1, 2, 3, 4 and 5 years). In order to assess whether the trial would have stopped early, the temporal sequence of data accumulation was replicated in exactly the manner it was in the original trial using the dates (which were available from the original trial databases) when each outcome measure was made. Using the original trial data, and selected options for the number of planned interim analyses and stopping boundaries, we will simulate how each study might have progressed using the methodological approach described by Parsons et al. for an adaptive two-arm clinical trial using early endpoints to inform decision making; this methodology is described in detail in an additional file [see Additional file 2]. The approach employed here, using available data from recent trials to retrospectively assess the utility of alternative (adaptive) designs, is similar in spirit to a number of others studies; see for instance [39] (Chapter 7).

To simulate a single instance of an adaptive trial the following procedure was implemented: (i) we decided on the number of interim analyses we wished to make, stopping probabilities and the information levels necessary to trigger the interim analyses; (ii) these settings were used to determine upper and lower stopping boundaries for the test statistics using pre-specified alpha-spending functions; (iii) data from the original trial were used to simulate information accrual in the new adaptive design, using the observed ordering of data accumulation from the original trial; (iv) when an information threshold was hit, a test statistic was calculated using all the available information, from the final (primary) and all early endpoints; (v) the test statistic was compared to the boundaries, with decisions on stopping following from this process; (vi) if the decision was to continue, more information was accrued and any additional interim analyses implemented until the final planned interim analysis.

Treatment effect estimates

The primary interest of all the RCTs discussed here was to estimate the effect of the test treatment, on the study outcome at the definitive (final) endpoint, time t (the primary study endpoint), which we hereafter call \(\beta _{t}\). In the simplest possible case a primary outcome is measured at time t only and these data alone inform the estimate of the treatment effect \(\beta _{t}\). However, if early outcomes (at times before t) are available, then they can provide information on the final outcome due to the correlation between the early and the final outcomes for each participant. A strong correlation (\(\rho\)) between, for instance, 3 and 6 month outcomes suggests that a good (or poor) outcome at 3 months will be indicative of a good (or poor) outcome at 6 months. Therefore, fitting longitudinal models to the time course of data allows one to exploit this early information to improve estimates of the treatment effects, through improved precision in estimating \(\beta _{t}\); this strategy for decision making in the setting of an adaptive design as been discussed previously [10, 12,13,14]. To be clear, in this model, treatment effects for the early outcomes per se do not provide information on treatment effects for the final trial outcome \(\beta _{t}\). The notation used here for the effect size estimate (\(\beta _{t}\)) reflects the fact that estimation follows from fitting a longitudinal linear model to the totality of outcome data. Methods for estimating \(\beta _{t}\) and \(\text {var}(\beta _{t})\) and example code in R, using all the data available at any time-point during follow-up (FU), are provided in an additional file [see Additional file 2]. The test statistic \(Z =\beta _{t}/\mathrm {sd}(\beta _{t})\) is used to make stopping decisions at the interim analyses using estimates of the covariance parameters (i.e. the correlations between outcomes \(\rho\) and standard deviations of the outcomes \(\sigma\)). The interim analyses are triggered at pre-set (expected) information thresholds, with observed information during recruitment and follow-up given by \(I=1/\text {var}(\beta _{t})\). In addition, as a means to assess the importance of the early outcome data in modifying estimates of the \(Z = \beta _{t}/\mathrm {sd}(\beta _{t})\), an analysis was undertaken that forced all the correlations to be zero; i.e. an analysis that uses final outcome data only. We designate these parameters, which show the evidence for treatment effects using final outcome data only, as \(\beta 0_{t}\) and \(\text {var}(\beta 0_{t})\), with \(Z0=\beta 0_{t}/\mathrm {sd}(\beta 0_{t})\) and \(I0=1/\text {var}(\beta 0_{t})\).

Estimates \(\beta _{t}\) can be obtained at each of the analyses. At the final analysis at the end of the trial, if complete follow-up data are available for all participants, then \(\beta 0_{t}\) and \(\text {var}(\beta 0_{t})\) will be equal to \(\beta _{t}\) and \(\text {var}(\beta _{t})\). However, for all the RCTs in this study there are some missing data such that there are a number of participants who did not provide final outcomes but had one or more early outcomes. If these early outcomes are correlated with the final outcomes, then they will provide some information on the final outcomes and cause estimates of treatment effects \(\beta _{t}\) to be somewhat different from \(\beta 0_{t}\), and also cause the former to have smaller variances than the latter. If we were reporting a conventional prospectively planned and implemented group sequential trial, rather than the simulated retrospective trials reported here, then we would generally need to adjust effect estimates for potential bias due to the interim analyses; for instance using Todd’s approach [40]. However, here we focus purely on the unadjusted effect estimates and stopping decisions, mainly for simplicity of exposition, as it precludes the need to make adjustments for every different setting of the boundaries for each trial.

Interim analyses

The number of feasible interim analyses for each RCT were determined, in large part, by the expected patterns of recruitment and data accumulation for each RCT. Interim analyses need to occur during the window of opportunity bounded at the start by the earliest time sufficient data are available for a sensible analysis to occur and at the end by the time when recruitment is completed. After the latter time-point, there is no advantage to stopping a study, as conventionally all participants recruited into the trial should complete follow-up. The number of possible interim analyses for each RCT was determined, before simulating data accumulation for the adaptive design, by a consideration of the likely width of the window of opportunity, which is itself determined by the likely pattern of recruitment and follow-up. We have endeavoured, where possible, to use only the information that would have been available to those designing the trials at the initial stages when decisions about the likely number of analyses would need to have had to be made. The lead statisticians from all of the selected trials were consulted on these issues, and the knowledge gained from them and from the published protocols for all the trials was used to inform the designs for each RCT. Details of the original sample fixed design size calculation for each RCT can be found in an additional file [see Additional file 2]. Clearly, if the selected RCTs had been prospectively planned as adaptive designs, then some adjustment to the sample size would have been made to maintain power at the required level. We make no attempt to increase the sample size, to maintain power, in this study but rather focus solely on the stopping decisions at the interim analyses.

Boundaries and information monitoring

Given the practical constraints imposed by the need for interim analyses to take place during the window of opportunity, we restrict this study to a maximum of three interim analyses, in addition to the final analysis, within any trial. The primary focus of this study is to assess whether and under what circumstances a group sequential design may have resulted in the selected trials stopping early. Many complex interventions (e.g. surgery or physiotherapy) tested in pragmatic publicly funded trials, unlike in the pharmaceutical industry, are licensed for use, without a requirement for information on efficacy that would be required to get them used in practice [41]. Adaptive designs methods, that are regularly applied in industry, have for the most part not been used in publicly funded trials [42], and this fact in large part provided the motivation for the selection of the trials described here. They are all publicly funded trials of complex interventions, typically incorporating health economic analysis, in difficult settings, with logistical and practical issues that many believe make adaptive trials difficult or almost impossible. We do not share this view, but rather believe that study designs using early looks at emerging data to assess stopping would have been perfectly possible and good options for all the selected trials. An early futility assessment has the potential to increase efficiency, save patients and decrease costs in publicly funded trials, and many trialists and statisticians suggest that, where possible, investigators should aim to include a futility analysis in their designs for such trials [41]. For these reasons, and the possibility of obtaining more enlightening results, we choose to focus mainly on futility stopping in our work. If we had chosen trials of a very different type (i.e. testing simple interventions), then we would likely have placed a much greater focus on efficacy stopping. We choose to adopt a range of previously suggested futility boundaries [10]; which we label as (a-d). These are defined by stopping probabilities in the setting of up to three interim analyses, that represented a sequence of four increasingly aggressive options, from a low probability of stopping for futility, labelled as (a), to a high probability, labelled as (d), with (b) and (c) intermediate to these. Table 2 shows the probabilities of stopping and rejecting the null hypothesis (H0) in favour of alternative (\(\alpha ^{*}_{u}\); efficacy), and the probabilities of stopping without rejecting H0 (\(\alpha ^{*}_{l}\); futility), for the four settings (a-d) for one, two and three interim analyses, under the null hypothesis that there is no difference between the two treatment groups. The stopping probabilities from Table 2 are used to construct appropriate boundaries for standardized test statistics at each of the planned interim analyses for each trial. This required us to make some assumptions, based on what we believe the trial team may have thought prior to the commencement of recruitment, about (i) the number of possible interim analyses, (ii) the expected standard deviations (\(\sigma ^{*}_{t}\)) and correlations (\(\rho ^{*}_{s,t}\)) between the early and final endpoints and (iii) the number of data-points that may have been available at each of the interim analyses; we use the \(*\) notation to distinguish expected values from observed values hereafter. Values for \(\sigma ^{*}\) were taken from the original (fixed design) sample size calculations, reported in the published trial protocols. Whereas \(\rho ^{*}_{s,t}\), which were generally unknown, were arbitrarily set for all pairs of outcomes to be \(\rho ^{*}_{s,s^{\prime }}=0.5\), to reflect an expectation of moderate to strong associations. In reality if the trials had been planned prospectively using a group sequential design, more realistic estimates of \(\rho ^{*}_{s,t}\) would have been used (e.g. from historical or pilot data) to determine stopping boundaries. The expected values of the covariance parameters are used to calculate the expected information necessary to trigger each interim analysis (\(I^{*}\)), which alone, together with the settings of Table 2, allow us to define stopping boundaries for the observed test statistics for the settings (a-d) for up to three interim analyses; further details can be found in an additional file [see Additional file 2].

Table 2 Four test settings (a-d) for futility and efficacy stopping with cumulative probabilities under the null hypothesis, \(\alpha ^{*}_{l}\) and \(\alpha ^{*}_{u}\) for one, two and three interim analyses

Full size table

Implementation

For each of the selected RCTs, data were simulated using the observed recruitment, such that they represent the order that data would have accumulated in real time (i.e. in the order data would have accumulated in the original trial). Information monitoring begins, after sufficient data are available to estimate accumulated information, and continues on a regular basis (every two weeks) to reflect what would likely have happened in the trial, if an adaptive design had been implemented. Once the required information level, to trigger an interim analysis, is reached, the test statistic is calculated and compared to the stopping boundaries. Decisions about whether the simulated group-sequential trial would have stopped, either for efficacy or futility, are made by comparing the estimated test statistics at each interim analysis to the stopping boundaries for the four scenarios (a) to (d). As a comparison, for trials that are stopped, at the interim analysis data on all those study participants recruited up to that point were used to estimate model parameters in an overrunning analysis [43, 44]; this analysis comprised all the data (complete follow-up) that would eventually have accumulated on those participants already recruited. This process of data collection and decision making is continued for subsequent interim analyses or until data accumulation is complete. This simulated process of data monitoring and analysis is exactly equivalent to how the process proceeded in the recently reported START:REACTS study [9], which proved to work well and efficiently for the study statistician (who oversaw the routine information monitoring) and the trial team.

Results

The recruitment accrual curves, windows of opportunity for stopping (shaded) and planned numbers and occasions for the interim analyses are shown schematically for each the simulated or re-imagined group-sequential trials in Fig. 1. A detailed description of the results of each of the simulated group-sequential trials is provided in an additional file [see Additional file 1]; for each of the seven selected RCTs this file shows the calculation and justification for the upper and lower stopping boundaries, the numbers of participants providing early and final outcome data, treatment group means and estimates of treatment effects, test statistics, correlations and variances at each interim analysis at overrunning. The progress of the simulated group-sequential trials is summarised in Fig. 2 which shows stopping boundaries, for all settings (a-d), and test statistics for each RCT, indicating where boundaries where crossed. The most important of the results from model fitting are also presented here in Tables 3, 4 and 5; these show estimates of the treatment effects and test statistics, correlations and standard deviations and numbers of participants and progress (in months) for each interim analysis for each trial respectively. The results are summarised in the following for each trial in turn.

Table 3 Estimates of the treatment effects (\(\beta _{t}\) and \(\beta 0_{t}\)) on the primary outcome at time t, test statistics (Z and Z0) and information accrual (I), at each interim analysis and the study end, for each RCT; where \(Z=\beta _{t}/\mathrm {sd}(\beta _{t})\), \(Z0=\beta 0_{t}/\mathrm {sd}(\beta 0_{t})\) and \(I=1/\text {var}(\beta _{t})\) and \(I0=1/\text {var}(\beta 0_{t})\). The primary outcome time-point t and the expected information \(I^{*}\), to trigger each interim analysis, are shown for each RCT

Full size table

Table 4 Estimates of correlations between early and primary outcomes (\(\rho _{s,t}\)) and standard deviations (\(\sigma\)), at each interim analysis and the study end, for each RCT; the expected correlations were \(\rho ^{*}_{s,s^{\prime }}=0.5\) for all pairs of outcomes for all RCTs and the primary outcome time-point t and expected standard deviation (\(\sigma ^{*}_{t}\)) are shown for each RCT

Full size table

Table 5 Numbers of participants (N) and progress in trial recruitment (total numbers of participants and months of recruitment), at each interim analysis and the study end, for each RCT; the primary outcome time-point t and follow-up (FU) time-points are shown for each RCT

Full size table

WAT

A single interim analysis was planned for the simulated group-sequential WAT trial. For all four boundary settings tested, the WAT study would not have stopped at the interim analysis, when data were available from \(N_{12m}=10\) participants with 12m outcomes, \(N_{6m}=29\) with 6m outcomes, \(N_{3m}=43\) with 3m outcomes and \(N_{6w}=49\) with 6w outcomes. At this interim analysis \(N=75\) participants had been recruited into the study and follow-up would have been completed in 17 months; this compares to \(N=126\) and 48 months for the original study. The expected standard deviation of the primary outcome (\(\sigma ^{*}_{12m}=9\)), used in the original sample size calculation, and used to build the group-sequential design, was much larger than the observed value at the interim analysis (\(\sigma _{12m}=5.6\)). This caused the interim analysis to take place at a much earlier time than planned (i.e. with fewer participants with 12m outcomes than expected; \(N_{12m}=10\) rather than the expected \(N^{*}_{12m}=40\)). However, given the small, but not clinically significant, result observed in the original study, it seems unlikely that any sensible stopping rule would have caused the WAT study to stop early.