 Technical advance
 Open Access
 Open Peer Review
 Published:
Introducing a new estimator and test for the weighted allcause hazard ratio
BMC Medical Research Methodology volume 19, Article number: 118 (2019)
Abstract
Background
The rationale for the use of composite timetoevent endpoints is to increase the number of expected events and thereby the power by combining several event types of clinical interest. The allcause hazard ratio is the standard effect measure for composite endpoints where the allcause hazard function is given as the sum of the eventspecific hazards. However, the effect of the individual components might differ, in magnitude or even in direction, which leads to interpretation difficulties. Moreover, the individual event types often are of different clinical relevance which further complicates interpretation. Our working group recently proposed a new weighted effect measure for composite endpoints called the ‘weighted allcause hazard ratio’. By imposing relevance weights for the components, the interpretation of the composite effect becomes more ‘natural’. Although the weighted allcause hazard ratio seems an elegant solution to overcome interpretation problems, the originally published approach has several shortcomings: First, the proposed point estimator requires prespecification of a parametric survival model. Second, no closed formula for a corresponding test statistic was provided. Instead, a permutation test was proposed. Third, no clear guidance for the choice of the relevance weights was provided. In this work, we will overcome these problems.
Methods
Within this work a new nonparametric estimator and a related closed formula test statistic are presented. Performance of the new estimator and test is compared to the original ones by a MonteCarlo simulation study.
Results
The original parametric estimator is sensible to missspecifications of the survival model. The new nonparametric estimator turns out to be very robust even if the required assumptions are not met. The new test shows considerably better power properties than the permutation test, is computationally much less expensive but might not preserve type one error in all situations. A scheme for choosing the relevance weights in the planning stage is provided.
Conclusion
We recommend to use the nonparametric estimator along with the new test to assess the weighted allcause hazard ratio. Concrete guidance for the choice of the relevance weights is now available. Thus, applying the weighted allcause hazard ratio in clinical applications is both  feasible and recommended.
Background
In many clinical trials, the aim is to compare two treatment groups with respect to a rarely occurring event like myocardial infarction or death. In this situation, a high number of patients has to be included and observed over a long period of time for a demonstration of a relevant treatment effect and to reach an acceptable power. Combining several events of interest within a socalled composite endpoint can lead to a smaller required sample size and save time as a higher number of events is mend to increase the power. The common treatment effect measure for composite endpoints is the allcause hazard ratio. This effect measure is based on the total number of events irrespective of their type. Commonly, either the logrank test or the Cox proportional hazards model [1–4] are used for analysing the allcause hazard ratio. However, the interpretation of the allcause hazard ratio as a composite treatment effect can be difficult. This is due to two reasons: First, the composite might not necessarily reflect the effects of the individual components which can differ in magnitude or even in direction [5–7]. Second, the distinct event types could be of different clinical relevance. For example, the fatal event ‘death’ is more relevant than a nonfatal event like ‘cardiovascular hospital admission’. Moreover, the less relevant event often contributes a higher number of events and therefore has a higher influence on the composite effect than the less relevant event.
Current guidelines on clinical trial methodology hence recommend to combine only events of the same clinical relevance [3, 8]. However, this is rather unrealistic in clinical practice, as important components like ‘death’ cannot be excluded from the primary analysis if a fatal event is clearly more relevant than any other nonfatal event. Therefore, to address the problems that arise within the analysis of a composite endpoint other methods to ease the interpretation of results are needed. An intuitive approach could be to define a weighted composite effect measure with weights that reflect the different levels of clinical relevance of the components. Weighted effect measures have been proposed and compared by several authors [9–14]. Some of the main disadvantages of these approaches include the high dependence on the censoring mechanism and on competing risks [13, 14]. Recently, Rauch et al. [15] proposed a new weighted effect measure called the ‘weighted allcause hazard ratio’. This new effect measure is defined as the ratio between the weighted average of the causespecific hazards for two groups. Thereby, the predefined weights are assigned to the individual causespecific hazards. With equal weights for the components the weighted allcause hazard ratio corresponds to the common allcause hazard ratio and thus defines a natural extension of the standard approach.
Although this new weighted effect measure seems an elegant solution to overcome interpretation problems, the originally published approach has several shortcomings: 1. The proposed original estimator for the weighted allcause hazard ratio requires prespecification of a parametric survival model to estimate the individual causespecific hazards. The form of the survival model, however, is usually not known in the planning stage of a trial. 2. No closed formula for a corresponding test statistic was introduced but a permutation test was used instead which comes along with a high computational effort. 3. No clear guidance for the choice of the relevance weighting factors was provided. In this work, we want to address these issues to make the weighted allcause hazard ratio more appealing for practical application. In particular we will provide answers to the following questions:

How robust is the original estimator for the weighted allcause hazard ratio against missspecifications of the underlying parametric survival model?

How robust is the new alternative nonparametric estimator for the weighted allcause hazard ratio?

How can we derive a closed formula test statistic for testing the weighted allcause hazard ratio?

How do the different estimators and tests behave in a direct performance comparison?

What are the required steps when choosing adequate weighting factors in the planning stage?
This paper is organized as follows: In the Methods Section, we start by introducing the standard unweighted approach for analysing a composite timetofirst event endpoint. In the same section, the weighted allcause hazard ratio is introduced as well as the original parametric estimator and the permutation test as recently proposed by Rauch et al. [15]. A new nonparametric estimator for the weighted allcause hazard ratio and a related closed formula test is introduced subsequently. Next, we provide a stepbystep guidance on the choice of the relevance weighting factors. In the Results Section, the different estimators and tests for the weighted allcause hazard ratio are compared by means of a MonteCarlo simulation study to evaluate their performance for various data scenarios, in particular those who meet and those who violate the underlying model assumptions. We discuss our methods and results and we finish the article with concluding remarks.
Methods
The standard allcause hazard ratio
The interest lies, throughout this work, in a twoarm clinical trial where an intervention I shall be compared to a control C with respect to a composite timetoevent endpoint. A total of n individuals are randomized in a 1:1 allocation to the two groups. The composite endpoint consists of k components EP_{j}, j=1,...,k. It is assumed that a lower number of events corresponds to a more favourable result. The observational period is given by the interval [0,τ]. The study aim is to demonstrate superiority of the new intervention and therefore a onesided test problem is formulated.
Definitions and test problem
The allcause hazard function for the composite endpoint is parametrized as
where X_{i} is the treatment indicator which equals 1 when the individual i belongs to the intervention group and 0 when it belongs to the control. Equivalently, the causespecific hazards for the components are given as
Note that the hazard for the composite endpoint is the sum of the causespecific hazards for the components [15]
The allcause hazard ratio for the composite is given as
where the indices I and C denote the group allocation and proportional hazards are assumed so that θ_{CE} is constant in time. Note that the proportional hazards assumption can only hold true for both the composite and for the components if equal causespecific baseline hazards are assumed across all components.
As motivated above, a onesided test problem for the allcause hazard ratio is considered. The hypotheses thus read as
Point estimator and test statistic
For estimating the allcause hazard ratio, a semiparametric estimator for the allcause hazard ratio \(\widehat {\theta }_{CE}\) can be obtained by means of partial maximumlikelihood estimator from the wellknown Coxmodel [1].
The most common statistical test to assess the null hypothesis stated in (2) is the logrank test. Let t_{l}, l=1,...,d, denote the distinct ordered event times for the pooled sample of both groups, where d is the total number of observed events irrespective of its type within the observational period [0,τ]. Moreover, let \(d_{EP_{j},l}=d_{EP_{j},l}^{I}+d_{EP_{j},l}^{C},\ l=1,...,d,\ j=1,...,k\) denote the observed amount of individuals that experience an event of type j at time t_{l} in the pooled sample given as the sum of the specific groupwise number of events. Similarly, let \(d_{l}=d_{l}^{I}+d_{l}^{C}=\sum \limits _{j=1}^{k}d_{EP_{j},l}^{I}+\sum \limits _{j=1}^{k}d_{EP_{j},l}^{C},\) denote the observed amount of individuals that experience an event of any type until time t_{l} in the pooled sample given as the sum of the groupwise number of events. The number of individuals at risk just before time t_{l} is denoted as \(n_{l}=n_{l}^{I}+n_{l}^{C}\). The NelsonAalen estimators for the cumulative allcause hazard functions over the entire observational period are given as
and
Under the null hypothesis stated in (2), the cumulative allcause hazards of both groups are equivalent. This means that the sum of the causespecific hazards are assumed to be equivalent. This does not automatically imply that the causespecific hazards are also equivalent. However, this more specific assumption is required to deduce the test statistic of the weightbased logrank test. Under the null hypothesis (2) and the additional assumption that the causespecific hazards are equivalent, the random variable \(D^{I}_{l},\ l=1,...,d^{I}\), for randomly sampling \(d_{l}^{I}\) events from \(n_{l}^{I}\) patients where \(n_{l}^{I}\) is a subset of the pooled sample with \(n_{l}^{I}+n_{l}^{C}\) individuals including a total of d_{l} events at a fixed time point t_{l} is hypergeometrically distributed as
Then the expectation of the additive \(D^{I}_{l}\) over all distinct t_{l}≤τ is
and the variance is given as
The corresponding logrank test thus reads as [16]
The test statistic LR is approximately standard normally distributed under the null hypothesis given in (2). Negative values of the test statistic favour the intervention and therefore the null hypothesis is rejected if LR≤−z_{1−α}, where z_{1−α} is the corresponding (1−α)quantile of the standard normal distribution and α is the onesided significance level.
The weighted allcause hazard ratio
Definitions and test problem
The idea of the weighted allcause hazard ratio is to replace the standard allcause hazard given in (1) by a weighted sum of the causespecific hazards using predefined relevance weights for the individual components that refer to their clinical relevance. The weighted allcause hazard is then given as
where the nonnegative weights \(w_{EP_{j}}\geq 0\), j=1,...,k, are reflecting the clinical relevance of the components EP_{j}, j=1,...,k. If the weights are all equally set to 1 \((w_{EP_{1}}=w_{EP_{2}}=...=w_{EP_{k}}=1)\), then the weighted allcause hazard corresponds to the standard allcause hazard.
The ‘weighted allcause hazard ratio’ as proposed by Rauch et al. [15] is then given as
where the indices I and C denote the group allocation. Note that the weighted allcause hazard ratio is a timedependent effect measure except for the case of equal baseline hazards across the components [15] which refers to
The weighted allcause hazard ratio can also be integrated over the complete observational period [0,τ]
In the remainder of the work, we will concentrate on the weighted allcause hazard ratio at a predefined timepoint for the sake of simplicity. Again, a onesided test problem for the weighted allcause hazard ratio is considered
The hypotheses to be assessed in the confirmatory analysis are thus equivalent to the common unweighted approach.
Original point estimator and test statistic
In order to estimate the weighted allcause hazard ratio Rauch et al. [15] proposed to identify and estimate the causespecific hazards via a parametric survival model. Rauch et al. [15] thereby focused on the Weibull model. This approach is thus based on the assumption that the causespecific hazards for each component are proportional. With the estimated causespecific hazards \(\hat \lambda ^{I}_{EP_{j}}\) and \(\hat \lambda ^{C}_{EP_{j}}\) derived from the Weibull model, a parametric estimator for the weighted allcause hazard ratio is given by
The prespecification of a survival model to identify the causespecific hazard must be seen as a considerable restriction as the shape of the survival distribution is usually not known in advance. Thus, it is of interest to evaluate how sensible the parametric estimator reacts when the survival model is missspecified. Moreover, there is the general interest in deriving a less restrictive nonparametric estimator.
A related variance estimator for (8) cannot easily be deduced and thus an asymptotic distribution of the parametric estimator given in (8) is not available. Therefore, Rauch et al. [15] considered a permutation test to test the null hypothesis specified above. For the permutation test the sampling distribution is built by resampling the observed data. Thereby, the originally assigned treatment groups are randomly assigned to the observation without replacement in several runs. Although this is an elegant option without the need to make further restrictive assumptions, the disadvantage is that such a permutation test is not available as a standard application in statistical software but requires implementation. Moreover, depending on the trial sample size and the computer capacities, this is a very time consuming approach.
New point estimator and closed formula test statistic
To derive the new point estimator, we will assume in the following that the baseline hazards for all individual components and for the composite are equivalent within each group, meaning that
i=1,...,n, and thus (4) reads as
This is a very restrictive assumption usually not met in practice. The assumption is only required to formally derive the new nonparametric estimator. We do not generally focus on data situations were this assumption is fulfilled. The estimator is only relevant for practical use if deviations from this assumptions produce no relevant bias. This will be investigated in detail in the sections Simulation scenarios and Results.
Under this assumption, the baseline hazards in the representation of the weighted allcause hazard ratio cancel out. By this, the weighted allcause hazard ratio is no longer timedependent. It is therefore possible to replace the causespecific hazards by the cumulative causespecific hazards:
where \(\Lambda _{EP_{j}}(t),\ j=1,...,k\), refer to the corresponding causespecific cumulative hazards over the period [0,t]. With X_{i} equal to 1 if the individual i belongs to the intervention and 0 otherwise. This representation can be used to derive a nonparametric estimator for the weighted allcause hazard ratio using the corresponding nonparametric NelsonAalen estimators given as
using the notations given in section Point estimator and test statistic. By this a nonparametric estimator for the weighted allcause hazard ratio is given by
In contrast to the parametric estimator \(\hat \theta ^{w}_{CE}(t)\) given in (8), the nonparametric estimator \(\widetilde {\theta }^{w}_{CE}(t)\) given in (10) does not require the prespecification of a survival model. However, the correctness of the nonparametric estimator is still based on the assumption of equal causespecific baseline hazards. In case the baseline hazards differ, \(\widetilde {\theta }^{w}_{CE}(t)\) can be calculated but represents a biased estimator for \(\theta ^{w}_{CE}(t)\). Therefore, it is of interest to evaluate how sensible the nonparametric estimator reacts when the equal baseline hazards assumption is violated.
An alternative testing procedure to the discussed permutation test can be formulated by a weightbased logrank test statistic derived from a modification of the common logrank test statistic given in (3). We use the expression ‘weightbased logrank test’ instead of ‘weighted logrank test’, as in the literature the weighted logrank test refers to weights which are assigned to the different observation time points whereas we aim to weight the different event types of a composite endpoint.
Under the null hypothesis (7) and under the assumption that the weighted allcause hazards are equal between groups, the random variable \(D_{EP_{j},l}^{I},\ j=1,...,k\), for randomly sampling \(d_{EP_{j},l}^{I}\) events of type EP_{j} from \(n_{l}^{I}\) patients where \(n_{l}^{I}\) is a subset of the pooled sample with \(n_{l}^{I}+n_{l}^{C}\) individuals including a total of d_{l} events at a fixed time point t_{l} is hypergeometrically distributed as
The expectation of the additive weighted \(D^{I}_{EP_{j},l}\) over all distinct t_{l}≤τ is given as
and the variance is given as
assuming that no events of different types occur at the same time point.
Thus, the weightbased logrank test for the proposed weighted effect measure can be defined analogous to (3) as
Under the null hypothesis of equal weighted composite (cumulative) hazards the test statistic (11) is approximately standard normal distributed. Hence, the null hypothesis is rejected if LR^{w}≤−z_{1−α}, where z_{1−α} is the corresponding (1−α)quantile of the standard normal distribution and α is the onesided significance level.
Note that the common weighted logrank test can be shown to be equivalent to the Cox score test [16] because the weights are working on the coefficient β and thus the partial likelihood and its logarithm can be easily deduced. The intention of the common weighted logrank test is to weight the time points. However, in our weightbased logrank test, the weights have another meaning and are working on the whole hazard not only on the coefficient. Thus, the loglikelihood translates to a form were the weights are additive and therefore the score test does not translate to the test statistic proposed in this work. This was also the reason why we called our test ’weightbased’ and not ’weighted’ logrank test. Our test is valid but must be interpreted as a Waldtype test statistic.
Stepbystep guidance for the choice of weights
When using the weighted allcause hazard ratio as the efficacy effect measure for a composite endpoint it is important to fix the weights in the planning stage of the study. This can be seen as a quite challenging task, as the choice of the weights importantly influences the final outcome and the interpretation of the results. Thus, it is important to choose the weights in a wellreflected way and not arbitrarily. To help researchers with this task, we provide detailed steps on how to choose appropriate weights for a specific clinical trial situation. When discussing the choice of weights, it must be kept in mind that by using the standard allcause hazard, i.e. the unweighted scenario, this corresponds to equal weights for all components implying that event types with a higher event frequency are naturally upweighted. Therefore, equal weights for all components can be considered at least as arbitrary as predefined weights according to relevance considerations. To define reasonable weights, we first recall the weighted allcause hazard function as introduced in (4)
The weighted allcause hazard can also be interpreted as the standard allcause hazard based on modified causespecific hazards \(\tilde {\lambda }_{EP_{j}}(t)\) where
Thus, by introducing the component weights we implicitly modify the event time distribution that is the corresponding survival function. When choosing a weight unequal to 1, the survival distribution changes its shape. For a weight larger than 1, the number of events artificially increases and as a consequence, the survival function decreases sooner. In contrast, for a weight smaller than 1 the survival distribution becomes more flat as the number of events is artificially decreased. Whereas the allcause hazard ratio can be heavily masked by a large causespecific hazard of a less relevant component, a more relevant component with a lower number of events can only have a meaningful influence on the composite effect measure, when it is upweighted (or if the less relevant component is downweighted accordingly). On the contrary, if a large causespecific hazard is downweighted this can result in a power loss. Therefore, weighting can improve interpretation but the effect on power can be positive or negative, depending on the data situation at hand.
In order to preserve the comparability to the unweighted allcause hazard ratio, we recommend to fix the weight of the most important component, which is often given by ‘death’, to 1. All other weights should then be chosen smaller or equal to 1. When considering a weight for the most relevant component larger than 1, this results in endless possibilities and it becomes more difficult to set the weights for the other less relevant events in an adequate relation. The general recommendation of fixing all weights \(w_{EP_{j}}\leq 1,\ j=1,...,k\) is moreover reasonable because choosing a set of weights which are both  smaller and greater than 1  can cause a situation where the weighted allcause hazard is equivalent to the standard allcause hazard. This is problematic because in this case we cannot differentiate if the effect is due to the weighting scheme or due to the underlying causespecific hazards. For illustration of the latter problem, consider two event types EP_{1} and EP_{2} with exponentially distributed event times, where EP_{1} corresponds to the more relevant endpoint
This leads to the standard allcause hazard
If the weights are chosen as \(w_{EP_{1}}=1.3\) and \(w_{EP_{2}}=0.8\) the weighted causespecific hazards are given as
and therefore, the weighted allcause hazard is equivalently given by
Choosing the weights \(w_{EP_{1}}=1\) and \(w_{EP_{2}}=0.6\) gives the weighted causespecific hazards
and therefore
where the influence of the weights is now visible. Instead of interpreting the weighted hazards, for the applied researcher it might be easier to consider the corresponding weighted composite survival function \(S_{CE}^{w}(t)\) given as
It can be seen that the weights are still working multiplicatively on the cumulative causespecific hazards and the event time distributions for the different event types are also connected multiplicatively. By the introduction of the weights we still assume that an individual can only experience one event but (for weights smaller than 1) less individuals experience the event. This means that the expected number of events decreases with a weight smaller than 1. Therefore, the weighted survival function for the composite still corresponds to a time to first event setting but with a proportion of events which is lower compared to the unweighted approach.
Comparing the graphs of the weighted and unweighted event time distributions can be a helpful tool for choosing the weights as shown in Fig. 1 for the exemplary setting discussed above. It can be seen that both weighting schemes yield a larger difference between the event time distributions when comparing the intervention versus the control, however the second weighting scheme shows the larger difference.
In conclusion, we recommend to proceed as follows in order to choose the weights

1.
Identify the clinically most relevant event type (e.g. ‘death’) and assign a weight of 1.

2.
Choose the order of clinical relevance for the remaining event types. For each event type EP_{j} you should answer the question "How many events of type EP_{j} can be considered as equally harmful than observing one event (or any other amount of reference events) in the clinically most relevant endpoint?". For example, if in the example given above 5 events of type EP_{2} are considered as equally harmful as one event of EP_{1}, then the weighting scheme proposed in Scenario B might be preferred. If instead the researcher arguments that 5 events of type EP_{2} are considered as equally harmful as 3 events of EP_{1}, then the weighting scheme proposed in Scenario A should be preferred. The weights are thus mend to bring all events to the same severity scale. By assigning a weight of 1 to the most relevant event type, this event type acts as the reference event. Therefore, the weighted survival function and its summarizing measures (median survival, hazard ratio) can be interpreted as a standard survival function for the reference event. For example, if ’death’ is the reference event, on a population and on an individual patient level, the weighted survival function then expresses the probability to be neither dead nor in a condition considered as equally harmful. The median weighted survival can be interpreted as the time when half of the population is either dead or in an equally harmful condition.

3.
If there are some assumption about the form of the underlying event time distributions, then the functional form of the causespecific hazards is known. The weighted causespecific hazards are obtained by simple multiplication with the weighting factors. We recommend to choose several weighting scenarios and to plot the resulting weighted and unweighted event time distributions and to investigate graphically how different weights would affect the expected survival time and median survival per group. Moreover, the weighted and unweighted hazard ratio can be analytically deduced and compared. By this, the impact of the weighting scheme becomes more explicit.
Simulation scenarios
To provide a systematic comparison of the original parametric estimator \(\widehat \theta ^{w}_{CE}(t)\) to the new nonparametric estimator \(\widetilde \theta ^{w}_{CE}(t)\) for the weighted allcause hazard ratio and in order to analyse the performance of the weightbased logrank test compared to the originally proposed permutation test we performed a simulation study with the software R Version 3.3.3 [17].
Within our simulation study, we investigate various data scenarios for a composite endpoint composed of two components EP_{1} and EP_{2}. We restrict our simulations to weights given by \(w_{EP_{1}}=1,\ w_{EP_{2}}=0.1\) or \(w_{EP_{1}}=0.1,\ w_{EP_{2}}=1\) where the two event types are thus considered to show a considerable difference in clinical relevance. The results for another less extreme weighting scheme are provided as Additional file 1 (i.e. weights 1 and 0.7). A total of 10 scenarios based on different underlying hazard functions were considered in order to mimic situations where the underlying assumptions of both approaches are fulfilled and those where they are (partly) violated. For the original parametric estimator, the causespecific hazards were estimated by fitting Weibull models. A total of 1000 data sets each with n=200 patients (100 patients per group) were simulated for each scenario. The amount of simulated data sets was limited to 1000 because of the timeconsuming runtime of the permutation test which was based on 1000 runs. We used the pseudorandom generator Mersenne Twister [18]. For simulating the underlying event times, the approach described by Bender et al. [19] was used. The minimal followup was either fixed to τ=1 or τ=2 year(s). For each scenario the methods were compared on the same data sets. In case of nonconvergence of a model, the data set was excluded. Table 1 lists the underlying hazard functions for the different simulation scenarios and summarizes briefly which assumptions are met. In Fig. 2 the corresponding weighted and unweighted event time distributions for the composite for the intervention and the control group are graphically displayed for all 10 scenarios. In addition, the related weighted and unweighted hazard ratios for the composite are visualized along with unweighted causespecific effects.
For Scenario 17 the causespecific hazards are Weibull, or exponentially, distributed with the hazard of the form [20]
Thereby, κ>0 is the scale parameter and ν>0 is the shape parameter. The investigated scenarios show to some extend the flexibility of the Weibull model. Situations with earlier occurring events for one event type (higher causespecific hazard) and later occurring events for the other event type (lower causespecific hazards) are capture as well as situations where the difference in hazards is smaller. In the scenarios 16 at least one causespecific hazard is timedependent whereas in Scenario 7 the hazards are constant.
The hazard for the composite increases over time for the Scenarios 1 and 3 and decreases for Scenario 2. For the Scenarios 4 and 5 the hazard first decreases and then increases after a while. For the Scenarios 1 and 3 the proportional hazards assumption is fulfilled for each of the event types simultaneously. Also note that in Scenario 3 and partly in the Scenarios 4 and 5 the effects for the event types point into opposite directions. Scenario 6 depicts a situation where no treatment effect for the individual components and the composite exists. In Scenario 7 there are opposite effects for the individual components which cancel out in the combined composite for one weighting scheme.
As we aim to quantify how robust the original parametric estimator for the weighted allcause hazard ratio based on the Weibull model is when the event times for the components in fact do not follow a Weibull distribution, Scenarios 8 to 10 are based on a Gompertz distribution. Like the Weibull model the Gompertz model fulfils the proportional hazards assumptions where the hazard is parametrized as
which is also referred to the GompertzMakeham distributed hazard [20, 21]. Again, κ>0 is a scaling parameter and ν>0 a shape parameter. In addition, a more general term ε≥−κ defining the intercept is formulated. For all Scenarios with Gompertz distributed event times the hazard for the composite increases over time. For the situation where the shape parameters are equal across all event types the proportional hazards assumption does apply to the composite. This is the case for Scenario 8 but not for the Scenarios 9 and 10. The proportional hazards assumption also holds true for each event type separately for the Scenarios 8 and 9. In Scenario 10 the proportional hazards assumption is violated for all event types and for the composite.
Results
Table 2 displays the results of the simulation study for all Scenarios 1 to 10. Columns 2 and 3 present check boxes for the underlying model assumptions of the two estimators. Especially for those scenarios where some assumptions are violated, we are interested in the (standardized) bias, the relative efficiency, and the coverage of the corresponding confidence interval for the different estimators (see Table 3). Thereby. the bias is quantified by comparing the mean logarthmized (natural) estimators (Table 2 Columns 10 and 11) to the corresponding natural logarithm of the true effect (Table 2 Column 7) which is fixed by the simulation setting. For the standardized bias the bias is divided by the corresponding standard error of the estimated effects. The relative efficiency is the quotient of the mean square error of the original estimator divided by the mean square error of the nonparametric estimator. A relative efficiency smaller than one is in favour of the original estimator. The mean square error is the sum of the quadratic bias and the quadratic standard error for the logarithmized estimators. The coverage is the proportion of times the 95% confidence interval for each estimated effect includes the true effect. To determine the confidence intervals, the standard error for all estimated effects is required which we obtained by the permutation distribution. Thereby, again logarithmic scale is used so that the estimators’ distribution is not skewed and thus their standard deviation and the performance measures in Table 3 can be interpreted. In Column 4 of Table 2 the time point τ at which the estimators are evaluated is shown and Columns 5 and 6 show the underlying component weights. Note that by switching the weights between the two components, we implicitly investigate the influence of all hazard combinations when the relevance of the components is reversed. Column 7 displays the logarithmized true weighted allcause hazard ratio at time τ which can be obtained from the underlying causespecific hazard functions. Columns 8 and 9 show the mean amount of events averaged over all data sets per scenario for all event types separately and its standard deviation. Columns 10 and 11 show the mean of the logarithmized estimated weighted allcause hazard and its standard deviation based on the original parametric estimator \(\widehat {\theta }^{w}_{CE}(\tau)\) and based on the new nonparametric estimator \(\tilde {\theta }^{w}_{CE}(\tau)\). Columns 12 to 13 show the empirical power values for the originally proposed permutation test based on \(\widehat {\theta }^{w}_{CE}(\tau)\) and for the new weightbased logrank test based on \(\tilde {\theta }^{w}_{CE}(\tau)\). Note that the reported power values correspond to onesided tests based on a onesided significance level of 0.025. Table 3 depicts the amount of simulations that converged (Columns 2 and 3), the bias (Columns 4 and 5), the standardized bias (Columns 6 and 7), the square root of the mean square error (Columns 8 and 9), the relative efficiency (Column 10), and the coverage (Columns 11 and 12) for the logarithmized original and new estimators.
Scenarios 1 and 3 reflect situations where the proportional hazards assumption is fulfilled for each component but the Weibull distributed causespecific hazards are unequal and thus the composite effect is timedependent. Since in this scenarios the assumptions for the original estimator is fulfilled it is intuitive that the (standardized) bias is small for the parametric estimator. Although the assumptions for the nonparametric estimator are violated the bias is still rather small. This good performance is also captured in the coverage which is mostly near the anticipated 95%. It is furthermore intuitive that the original estimator shows most often a smaller mean square error in relation to the nonparametric estimator. Note that in Scenario 3 the unweighted effects point into different directions but the direction of the weighted effect depends on the weighting scheme. In the Scenarios 2, 4, and 5 the proportional hazards assumption is not fulfilled for neither the components nor for the composite but the causespecific hazards still follow a Weibull distribution. For Scenario 2 it can be seen that the estimated weighted effects are the same for both estimators but do not approach the true effect as good as in the Scenarios 1 and 3. This is because both approaches need at least the assumption of proportional hazards in the components. A similar outcome would be expected for the Scenarios 4 and 5. However, in both scenarios the parametric estimator performs much worse than the nonparametric estimator. This is due to the higher variability in the estimations. For Scenario 6 where there is no effect for the unweighted composite both approaches perform quite well. For the original estimator this was expected since its assumptions are fulfilled. In Scenario 7 with the weights 1 for event type 1 and 0.1 for event type 2 the true combined treatment effect is 0. This is also captured quite well in both estimators. Note that only for this specific weighting scheme the composite effect is 0 but not for the other weighting schemes. However, the performance of the estimation approaches is also satisfying for the other weighting schemes. In Scenario 8, GompertzMakeham distributed causespecific hazards are assumed. Thereby, the proportional hazards assumption is fulfilled for the components and the composite. Thus, it is intuitive that the new nonparametric estimator closely coincide with the true effect. However, the parametric estimator based on the Weibull model is relevantly biased independent of the weighting scheme and shows a higher variability. Scenario 9 still depicts GompertzMakeham distributed causespecific hazards but the proportional hazards assumption is only fulfilled for the components and not for the composite. Although the causespecific baseline hazards are thus unequal the nonparametric estimator performs better in this scenario whereas the parametric estimator shows substantial bias and variability which might be also due to convergence problems. Scenario 10 represents Gompertz distributed causespecific hazards where the proportional hazards assumption is neither fulfilled for the components nor for the composite. Compared to the two previous scenarios the performance of the parametric estimator has increased and is not globally worse than that of the nonparametric estimator. The performance depends on the weighting scheme. In here, not all τweightcombinations are displayed. However, the performance of the missing combination scenarios is comparable to the corresponding scenarios displayed.
In conclusion, the original parametric estimator turns out to be sensible against model missspecifications for estimating the underlying causespecific hazards as expressed by most values of the (standardized) bias and the coverage of the confidence intervals for Scenarios 4, 5, 8, 9, 10. In these scenarios, the performance of the nonparametric estimator tends to be better because not only the (standardized) bias is smaller and the coverage probability is better but also the relative efficiency favours the nonparametric approach. Moreover, in Scenarios 4 and 5 the (standardized) bias of the parametric estimator is smaller and its variation is considerably higher which cannot only be explained by the smaller amount of converged simulations. The higher amount of nonconverging models for the original approach is furthermore a disadvantage. In scenarios where the assumption for the parametric estimator is fulfilled (Scenarios 1 and 3) its performance tends to be better than for the nonparametric approach. Although in these scenarios the assumption of equal causespecific baseline hazards is violated, the performance of the nonparametric estimator is however not considerably worse than for the parametric estimator.
Except for Scenario 1b, the power of the weightbased logrank test is uniformly equal or larger than the power of the permutation test. This power advantage in particular occurs in situations where the two point estimators coincide (Scenarios 2a and 2b or 10a) or even when the nonparametric estimator suggests a less extreme effect (Scenarios 8 or 9). For Scenario 6 where there is no effect for the components nor for the composite the permutation test in the investigated scenarios performs better in terms of preserving the type one error. In Scenario 7 where the composite effect is 0 for one weighting scheme the type one error is preserved for the permutation test as well as the weightbased logrank test in this scenario.
If the weights are chosen to be 1 and 0.7, the performance comparisons basically come to the same results (compare Additional file 1). Summarizing the results of our simulation, the new nonparametric estimator and the corresponding weightbased logrank test outperform the original estimator and the permutation test.
Discussion
In this work, we investigated a new estimator and test for the weighted allcause hazard ratio which was recently proposed by Rauch et al. [15] as an alternative effect measure to the standard allcause hazard ratio to assess a composite timetoevent endpoint. The weighted allcause hazard ratio as a weighted effect measure for composite endpoints is appealing because it is a natural extension of the allcause hazard ratio. It allows to regulate the influence of event types with a greater clinical relevance and thereby eases the interpretation of the results. Generally it must be noted that the weighted allcause hazard ratio was introduced to ease the interpretation of the effect in terms of clinical relevance. The aim of the weighted effect measure is not to decrease the sample size or increase the power. The power of the weighted allcause hazard ratio can be larger but may also be smaller than the power of the unweighted standard approach.
The original parametric estimator proposed by Rauch et al. [15] requires the specification of a parametric survival model to estimate the causespecific hazards. Moreover, in the original work by Rauch et al. [15] a permutation test was proposed to test the new effect measure which comes along with a high computational effort. In this work, we overcome these shortcoming by proposing a new nonparametric estimator for the weighted allcause hazard ratio and a closed formulabased test statistic which is given by a weightbased version of the wellknown logrank test.
The simulation study performed within this work shows that the original parametric estimator is sensible to missspecifications of the underlying causespecific event time distribution. If there are uncertainties about the underlying parametric model for the identification of the causespecific hazards we therefore recommend to use the new nonparametric estimator. In fact, the new nonparametric estimator proposed in this work turns out to be more robust even if the required assumption of equal causespecific baseline hazards is not met. The relative efficiency as well as the coverage depict also that the performance of the nonparametric estimator is in most cases at least as good as the original parametric estimator. Additionally, in our scenarios convergence problems arose more often when using the parametric estimator. This problems in convergence arose in scenarios where the effect of one event type was either very high at the beginning of the observational period or there was nearly no effect at the end of the observational period where the survival function reaches 0. Moreover, the simulation study shows that the new weightbased logrank test results in considerably better power properties than the originally proposed permutation test in almost all investigated scenarios. In some scenarios the type one error might not be preserved and it has to be further investigated in which this is exactly the case and how it can be addressed. In addition, the weightbased logrank test is computationally much less expensive. However, one remaining restriction is that confidence intervals cannot be directly provided because the testing procedure is not equivalent to the Cox score test. The only possibility to provide confidence intervals for the weighted hazard ratio would be by means of bootstrapping techniques.
Apart from investigating the performance of the point estimator and the related statistical test, we additionally provide a stepbystep guidance on how to choose the relevance weights for the individual components in the planning stage. It is often criticized that the choice of relevance weights in a weighted effect measure is to a certain extend arbitrary. By applying our stepbystep guidance for the choice of weights, this criticism can be addressed. To be concrete, we propose to choose a weight of 1 for the clinically most relevant component and to choose weights smaller or equal to 1 for all other components by judging how many events of a certain type would be considered as equally harmful than an event in the most relevant component. Using this approach for defining the weights, comparability to the unweighted approach is given and the most relevant event serves as a reference. When the shape of the different event time distributions is known in the planning stage, we also recommend to look at the plots of the weighted and unweighted event time distributions for different weight constellations to visually inspect the influence of the weight choice on the shape of the survival curves and on the treatment effect.
Conclusion
In conclusion, we recommend to use the new nonparametric estimator along with the weightbased logrank test to assess the weighted allcause hazard ratio. When applying the weighting scheme proposed within our stepbystep guidance, the choice of the weights can be motivated with reasonable clinical knowledge. With the results from this work, the weighted average hazard ratio therefore becomes a very attractive new effect measure for clinical trials with composite endpoints.
Availability of data and materials
Simulated data and R programs can be obtained from the authors upon request.
References
 1
Cox DR. Regression models and lifetables. J Royal Stat Soc Ser B (Methodol). 1972; 34(2):187–220.
 2
Lubsen J, Kirwan BA. Combined endpoints: can we use them?Stat Med. 2002; 21(19):2959–7290.
 3
U.S. Department of Health and Human Services. Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER), ICH. Guidance for Industry: E9 Statistical Principles for Clinical Trials. 1998. http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073137.pdf. Accessed 23 Aug 2017.
 4
Rauch G, Beyersmann J. Planning and evaluating clinical trials with composite timetofirstevent endpoints in a competing risk framework. Stat Med. 2013; 32(21):3595–608.
 5
Bethel MA, Holman R, Haffner SM, Califf RM, HuntsmanLabed A, Hua TA, Murray J. Determing the most appropriate components for a composite clinical trial outcome. Am Heart J. 2008; 156(4):633–40.
 6
Freemantle N, Calvert M. Composite and surrogate outcomes in randomised controlled trials. Am Heart J. 2007; 334(1):756–7.
 7
Freemantle N, Calvert M, Wood J, Eastaugh J, Griffin C. Composite outcomes in randomized trials  greater precision but with greater uncertainty?J Am Med Assoc. 2003; 289(19):756–7.
 8
Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen. General methods  version 5.0. 2017. https://www.iqwig.de/download/allgemeinemethoden_version50.pdf. accessed 23 Aug 2017.
 9
Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J. 2012; 33(2):176–82.
 10
Buyse M. Generalized pairwise comparisons of prioritized outcomes in the twosample problem. Stat Med. 2010; 29(30):3245–57.
 11
Péron J, Buyse M, Ozenne B, Roche L, Roy P. An extension of generalized pairwise comparisons for prioritized outcomes in the presence of censoring. Stat Methods Med Res. 2016; 27(4):1230–9.
 12
Lachin JM, Bebu I. Application of the wei lachin multivariate onedirectional test to multiple eventtime outcomes. Clin Trials. 2015; 12(6):627–33.
 13
Bebu I, Lachin JM. Large sample inference of a win ratio analysis of a composite outcome based on prioritized outcomes. Biostatistics. 2016; 17(1):178–87.
 14
Rauch G, JahnEimermacher A, Brannath W, Kieser M. Opportunities and challenges of combined effect measures based on prioritized outcomes. Stat Med. 2014; 33(7):1104–20.
 15
Rauch G, Kunzmann K, Kieser M, Wegscheider K, Koenig J, Eulenburg C. A weighted combined effect measure for the analysis of a composite timetofirstevent endpoint with components of different clinical relevance. Stat Med. 2018; 37(5):749–67.
 16
Lin RS, León LF. Estimation of treatment effects in weighted logrank tests. Contemp Clin Trials Commun. 2017; 8(1):147–55.
 17
R Core Team. R: A language and environment for statistical computing. 2018. https://www.rproject.org/. 2017, Version 3.3.3.
 18
Matsumoto M, Nishimura T. Mersenne twister. a 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comput Simul. 1998; 8(1):3–30.
 19
Bender R, Augustin T, M B. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005; 24(11):1713–23.
 20
Kleinbaum DG, Klein M. Survival Analysis, A SelfLearning Text, Third Edition. New York: Springer; 2012.
 21
Pletcher SD. Model fitting and hypothesis testing for agespecific mortality data. J Evolution Biol. 1999; 12(3):430–9.
Author information
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional files
Additional file 1
The additional file contains further simulation results with the distributions described in this work but other weighting schemes.(PDF 172 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ozga, A., Rauch, G. Introducing a new estimator and test for the weighted allcause hazard ratio. BMC Med Res Methodol 19, 118 (2019). https://0doiorg.brum.beds.ac.uk/10.1186/s1287401907651
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s1287401907651
Keywords
 Composite endpoint
 Weighted effect measure
 Weightbased logrank test
 Simulation study