Skip to main content

Instrumental variable analysis in the context of dichotomous outcome and exposure with a numerical experiment in pharmacoepidemiology



In pharmacoepidemiology, the prescription preference-based instrumental variables (IV) are often used with linear models to solve the endogeneity due to unobserved confounders even when the outcome and the endogenous treatment are dichotomous variables. Using this instrumental variable, we proceed by Monte-Carlo simulations to compare the IV-based generalized method of moment (IV-GMM) and the two-stage residual inclusion (2SRI) method in this context.


We established the formula allowing us to compute the instrument’s strength and the confounding level in the context of logistic regression models. We then varied the instrument’s strength and the confounding level to cover a large range of scenarios in the simulation study. We also explore two prescription preference-based instruments.


We found that the 2SRI is less biased than the other methods and yields satisfactory confidence intervals. The proportion of previous patients of the same physician who were prescribed the treatment of interest displayed a good performance as a proxy of the physician’s preference instrument.


This work shows that when analysing real data with dichotomous outcome and exposure, appropriate 2SRI estimation could be used in presence of unmeasured confounding.

Peer Review reports


In observational studies, unobserved confounding may biase the estimation of target effect. Over the last decade, this issue has recieved a growing attention in the field of epidemiological studies attempting to assess adverse effects of drugs with a few works focusing on instrumental variable (IV) approaches. Instrumental variable estimation is a well known approach for assessing endogeneity in statistical modelling [1, 2]. Endogeneity often arises when a causal model is poorly specified therby introducing a structural bias in the estimation of its parameters. This may result from a measurement error in variables [3], an unobserved variable [4] or an inverse causality between the outcome and some regressors. The general IV method of estimation attempts to remove this bias by using structural equations which incorporate instrumental variables in the model. Several theoretical approaches have been developed to build estimators of parameters and to study the properties of these estimators in causal models with endogeneity (see [5]). A well-known example is the case of linear models in which IV estimation leads to estimators with good properties of convergence such as consistency discussed in [6]. The structural bias can be completely removed in this case.

In pharmacoepidemiology, we most often deal with binary covariables (drug exposure), binary responses (adverse event indicator) and confounding variables which are variables that are correlated with exposure and response. A nonlinear model should be the first choice in this context to match with the specific nature of the variables. However, to quantify the risk of the adverse effect of a treatment in the presence of unobserved confounding, researchers investigating the IV-based estimation often model the probability of dichotomous events as a linear function of covariables, thus ignoring the basic features of a probability. Terza and colleagues [7] investigated the influences of mispecification on the estimation when a linear IV model is used in an inherently nonlinear regression setting with endogeneity. A substantial bias was demonstrated in their results. In the context of pharmacoepidemiology modelling, endogeneity is often due to unobserved confounding and various nonlinear IV methods such as the Generalized Method of Moment (GMM;[8, 9]) or the two-stage residual inclusion (2SRI; [10]) can be used to solve this issue. However in a review of IV methods, Klungel and colleagues [11] claimed that the GMM estimator with the logistic regression model is not consistent for causal Odd Ratio (OR) estimation owing to the non-collapsibility of the OR. Consistency is also not guaranteed for the 2SRI when the regression models are nonlinear in both stages. For these nonlinear IV methods, theoretical results exist under very restricted assumptions which do not cover the possible frameworks of real data. Overall, in the context of binary outcome several simulation studies investigate mainly 2-stage IV methods with the first step being linear and the second step being logistic as in [12]. A few articles concern double probit models (Chapman et al. [13]). Very few address the comparison of GMM and 2-stage approaches and none study GMM, 2-stage double logistic using the prescription preference-based instrumental variables.

In a simulation study, we compare the IV-based GMM and the 2SRI methods to the conventional method which does not account for endogeneity. These comparisons are based on the estimation of the regression coefficient of the exposure varible in nonlinear logistic model. Our numerical comparison of the methods involves several scenarios with different confounding levels and different instrument strengths for which computation formulas are established in the context of dichotomous outcome and exposure. We recall the general formula of the covariance matrix for the two-step estimation methods and give the corresponding expression for two-stage nonlinear least squares method in the context of logistic regressions.

The paper is organized as follows: we specify the model and describe the methods of estimation that will be analysed. Then we describe the simulation design, the criteria for evaluating the performances of the methods and the results of our simulations. The final sections discuss the results and make some concluding remarks. Details on the computation of the covariance matrix of the 2SRI method, the instrument’s strength and the confounding level are to be found in the appendices, as well as a detailed description of the simulation model and supplementary results.



We consider a general model with dichotomous outcome and exposure that can be written as

$$ Y = F(\beta_{0} + T\beta_{t}+ X_{1}\beta_{1}+ X_{2}\beta_{2}+ X_{u}\beta_{u}) + e $$
$$ \mathbb{E}(T|Z,X_{1},X_{2},X_{u}) = r(\alpha_{0} + Z\alpha_{z}+ X_{1}\alpha_{1}+ X_{2}\alpha_{2}+ X_{u}) $$

with Y and T as binary outcome (event or not) and treatment (T1 or T2) respectively, X1,X2 some covariables and X u an unobserved confounder of the outcome and treatment. The function F(.)=r(.) denotes the logistic distribution function also known as expit(.): expit(.)=exp(.)/(1+exp(.)) and e denotes the error term. The parameter β=(β0,β t ,β1,β2) denotes the vector of unknown parameters to be estimated. Without a confounding variable, all observed regressors are exogenous. In this case, the true model is written

$$ Y = F(\beta_{0} + T\beta_{t}+ X_{1}\beta_{1}+ X_{2}\beta_{2}) + e $$

and conventional regression methods are suitable for estimating the parameters β. We will denote (3) the conventional model. If this model is adjusted to data in the presence of an unobserved confounder, the estimated coefficients would lead to a bias with a level depending on the confounding level. As the confounder X u is not independent of treatment, the residuals of the conventional model are associated with the treatment. This causes endogeneity so a single regression of the outcome on observed covariables will fail to estimate β t efficiently. A common strategy is to consider another regression model that links the endogenous variable with others. Equation (2) defines the auxiliary model that predicts treatment T as a function of covariables X1, X2, the confounder X u and another variable Z. Variable Z denotes the instrumental variable (or instrument) related to the treatment, i.e. a variable correlated with the treatment and which has no direct association with the outcome.

The bias due to the unobserved confounder can significantly be reduced by means of the two-stage regression model using a valid instrument. As defined by Johnston and colleagues [14] and Greenland [15], a valid instrument must not be correlated with an unobserved confounder or with the error term in the true model (1). Formally, we assume that the instrument Z meets the following assumptions:

  • \({\mathbb {C}ov}(Z,Y|T,X_{u},X_{1},X_{2})=0,\)

  • \({\mathbb {C}ov}(Z,T) \neq 0,\)

  • \({\mathbb {C}ov}(Z,X_{u}) = 0, {\mathbb {C}ov}(Z,X_{1}) = 0\) and \({\mathbb {C}ov}(Z,X_{2}) = 0\).

We also assume that the confounder X u is not associated with the covariables X1 and X2, that is \({\mathbb {C}ov}(X_{u},X_{1}) = 0\) and \({\mathbb {C}ov}(X_{u},X_{2}) = 0\). The main goal is to estimate β t , the treatment coefficient which is the basis of risk evaluation; however an estimation of β t is obtained in general by estimating vector β which is discussed below.

As already proved in the simple case of a linear model (for which the functions F and r are equal to identity in Eqs. (1) and (2)), a high association between the treatment and an instrument should improve the IV estimation of β. Finding a strong instrument is then a crucial step in all procedures of instrumental variable estimation.

In what follows, we first present some specific instrumental variables often used in pharmacoepidemiology, then discuss some IV estimators of β to obtain an estimate of β t before addressing the properties of these estimators.

Instrument in pharmacoepidemiology

An instrumental variable can be determined in many ways, provided that it meets the assumptions listed above. One of the problems is to find a valid instrument with a reasonable strength. The strength of an instrumental variable can be defined as resulting from the level of its association with the endogenous treatment. As such, it could be quantified by using the correlation coefficient between the treatment and the related instrument. In the wide range of pharmacoepidemiologic applications, we can summarize the various instrumental variables in three categories:

Geographical variation. Proximity to the care provider can positively influence access to treatment of a patient compared to others who live far away from health services. To account for this difference between patients, some researchers (see [16]) consider the distance between a patient and a care provider as an instrumental variable. Although this seems realistic as there is no direct association between this distance and the occurrence of disease, the presence or absence of health services can be associated with some socioeconomic characteristics. The latter are often considered as unmeasured confounders that call into question the suitability of using this instrument.

Calendar time. The use of calendar time as an instrument in pharmacoepidemiology often relies on the occurrence of an event that could change the attitude of the physician or patient regarding a treatment. This could be a change in guidelines for example or a change due to the arrival of a new drug on the market. The time from that event to the date of treatment defines the calendar time which clearly affects the outcome of the treatment since the change in physician or patient attitude will be more pronounced immediately after the event has occurred than later. An example of use of calendar time as an instrument can be found in [17].

Physician’s prescription preference. The most often used instrumental variables in pharmacoepidemiology are preference-based [1820]. The issue is to compare the effectiveness of two treatments T1 and T2 when the assignment of treatment to the patient is not randomized. This is the case in observational studies where the prescriber of the treatment (the physician) introduces an effect that influences the outcome via the prescribed treatment. This effect results in the instrumental variable that reflects the influence of care-providers on the patient-treatment relationship. Brookhart and colleagues [21] define this instrumental variable as the “Physician’s Preference” (PP) and propose to use the treatment prescribed by a physician to its previous patient as a proxy of this IV for his/her new patient.

The instrumental variable \(\left (Z_{i}^{*}\right)\) of the ith patient will then be the treatment prescribed to the previous patient of the same physician. As a physician’s preference could change over time, Abrahamowicz and colleagues [22] introduce a new procedure to detect the change point and build a new proxy of PP that includes not only the treatment prescribed to the previous patient but all the previous prescriptions since the change point.

Those instrumental variables and some others are presented in a more detailed form in [23], [24, 25] or in [26] with enlightening discussion on their validity. In this work, we carry out a simulation study to examine how the proxy Z of physician’s preference performs in the context of logistic regression. We also examine the physician’s preference-based IV in the continuous form (see [22]), i.e. the proportion (pr) of all previous patients of the same physician who were prescribed the treatment of interest. This corresponds to the empirical estimator of the probability for a physician to prefer the treatment of interest.

IV Estimation of β t

Estimating β t in model (1) with an unobserved confounder X u amounts to estimating vector β and taking the corresponding component of treatment T. Below, we present two methods that can provide consistent estimation of β and then that of β t : the two-stage residual inclusion (2SRI) method and the generalized method of moment (GMM).

The two-stage residual inclusion method is a modified version of the two-stage least squares (2SLS) method used to estimate the parameters in linear models with instrumental variables. As mentioned by Greene [5], the first stage of the 2SLS method predicts the endogenous variable (the treatment here) using the instruments and other covariables (Eq. (2)). In the second stage, the endogenous variable is just replaced by its prediction from the first stage. This method is called two-stage predictor substitution (2SPS) when the first stage is nonlinear. Unlike the 2SLS, the residuals of the first regression serve as a regressor in the second stage of 2SRI method. This method also generalizes to the nonlinear models i.e. when first and second stages are nonlinear. The rationale of this approach can be intrinsically related to the form of the true model: sometimes, the prediction equation of the outcome includes the error term of the auxiliary regression. An example is the case when the confounder is the only source of error in the auxiliary regression as considered in [27]. In a linear model, both 2SPS and 2SRI approaches are equivalent.

The GMM is an alternative method for obtaining a reliable estimator of parameter β in a model with an endogenous variable. It is based on the classical assumption

$$ \mathbb{E}(e|T,X_{1},X_{2},X_{u}) =0 $$

of the error term in Eq. (1). This assumption does not hold in general because the confounder X u is unobserved (i.e. \(\mathbb {E}(e|T,X_{1},X_{2}) \neq 0)\). In this case, the treatment is endogenous and its coefficient β t cannot be consistently estimated. One suppose in general that there exists some observable instruments w1 such that \(\mathbb {E}(e|w) = 0\) with w=(w1,X1,X2). Typically, w corresponds to the vector of exogenous and endogenous variables with endogenous regressors replaced by their corresponding instruments. Using the law of iterated expectation, the last condition implies

$$ \mathbb{E}(e.w) = 0. $$

The method of moment solves the empirical version of (5) in β, i.e. \(\frac {1}{n} \sum _{i = 1}^{n} e_{i}w_{i} = 0\) where n is the sample size, e i is the ith component of e and w i the ith row of w. In turn, the GMM minimizes the quadratic form

$$ q(\beta) = \left(\frac{1}{n} \sum_{i = 1}^{n} e_{i}w_{i} \right)^{'}\Omega \left(\frac{1}{n} \sum_{i = 1}^{n} e_{i}w_{i}\right) $$

where Ω denotes a weighting matrix. As discussed in [28], there are several choices for matrix Ω leading to different estimators of β. The optimal approach is to define Ω as the inverse of the asymptotic covariance matrix (depending on β) of the estimator. Some alternative procedures are also suggested by Hansen and colleagues [29].


The properties of the 2SRI are addressed by Terza in [27] when the residual also acting as unobserved confounder in the first-stage regression (at Eq. (2)) is additive. Under this assumption and using nonlinear least squares regression in each stage, they show the consistency of the estimator \(\hat {\beta }\) from the second stage. Since the confounder is not additive in the model (2), the first stage estimate of a 2SRI will not be consistent, nor will \(\hat {\beta }\). The residual from the first-stage is indeed an unknown function of an unobserved confounder. Then there is a bias that depends on the form of this unknown function when one applies the 2SRI method to the structural model at Eqs. (1) and (2). The derivation of the covariance matrix of 2SRI estimator follows a two-step regression covariance matrix of the form

$${} \begin{aligned} {\mathbb{V}ar}(\hat{\beta}) &=\! \left(\!A_{22}^{-1} S_{2} A_{22}^{-1^{\prime}}\!\right)\!\!/n +\! \left(\!A_{22}^{-1}A_{21} A_{11}^{-1} S_{1}A_{11}^{-1^{\prime}}A_{21}^{\prime} A_{22}^{-1^{\prime}}\!\right)\!\!/n\\ &\quad- \left(A_{22}^{-1} S_{21} A_{11}^{-1^{\prime}}A_{21}^{\prime} A_{22}^{-1^{\prime}}\right)/n, \end{aligned} $$

where the computation of matrices A and S is given in Appendix A.

The GMM is a well documented estimation procedure. Both in linear and nonlinear models, several results on the estimator have already been established. In the literature on econometric analysis, the nonlinear GMM with an instrumental variable has received particular attention. In the pioneering work by Amemiya [30], the author demonstrated the consistency and derived the asymptotic distribution of the nonlinear two-stage least squares estimator (NL2SLS).

This result provided an important insight into how to handle nonlinear models with endogeneity. Later, Hansen [31] showed the asymptotic properties (consistency and asymptotic distribution) of the GMM to be a kind of generalization of the NL2SLS. More recently, Cameron and Trivedi [28] reviewed the method and gave details on the computation of the estimator’s covariance matrix in some specific cases. Despite the different results on the GMM with an instrumental variable, its performances in terms of bias and variance depend on the validity and strength of the instrument and the nature of the variables in the model. The computation of the covariance matrix of GMM is given in [28] and implemented in dedicated softwares of which [32] is a good example.

Below, we investigate the performances of these methods with numerical experiments.

Simulation design and data generation

A numerical experiment was conducted to investigate and compare several methods of IV estimation in the context of a dichotomous outcome and exposure with endogeneity in pharmacoepidemiology. In this experiment, we cover a wide range of possible scenarios. We choose values of parameter α=(α0,α z ,α1,α2) corresponding to some values of correlation between the variable T=α0+PPα z +X1α1+X2α2+X u and the Physician’s prescribing Preference instrument PP. In fact, we keep α0, α1 and α2 fixed and only α z varies from a scenario to the other. The computation of this correlation is given in Appendix B. It somewhat reflects the strength of the instrument when the confounder and other covariables are kept fixed. For each value of the instrument’s strength, there are three levels of confounding measured by the standard deviation σ u of the confounding variable X u , σ u {0.5,1,1.5} which leads to a set of correlations between T and X u . We then have nine scenarios of strengths and confounding level.

For each scenario, we generate ns=1000 Monte Carlo samples of size n, n=10000,20000 and 30000. The number of patients per physician is kept fixed and equals 100; the confounder X u and covariates X1 and X2 are assumed to have the normal distributions N(0,σ u ), N(−2,1) and N(−3,1) respectively and the physician’s prescribing preference has the Bernoulli distribution B(0.7). We first simulate the covariables X1 and X2, the confounder X u and the physician’s prescribing preference which is the same for all the patients of the same physician. Using the already fixed values of parameters α, the probability p i that patient i will be prescribed the drug of interest is calculated by inverting the logit function, i.e. p i =F(α0+PP i α z +X1iα1+X2iα2+X ui ). Treatment T i of patient i is then generated as a Bernoulli realisation with parameter p i . The same procedure as for the treatment is used to simulate for each patient i, the corresponding outcome y i . We fixed the parameter α such that the proportion of exposed patients ranges between 2 and 6% and the prevalence of the event of interest is chosen to be smaller than 5% to reflect a real-life situation of a new treatment (not frequently prescribed) and rare adverse event. With the fixed value of β, we compute the probability F(β0+T i β t +X1iβ1+X2iβ2+X ui β u ) of the event for patient i and then simulate his outcome y i . We also explored a more balanced situation in term of exposure frenquencies (between 26 and 45%).

Finally, proxy Z of PP is the treatment given to the previous patient and the continuous instrument pr is the proportion of patients of the same physician who were previously prescribed the treatment of interest. More details on the simulation model and the data generating R code are given in Appendix D.

Estimation methods

For the true and conventional models which do not assume endogeneity, the classical one-step regression method without instrumental variable is used. The estimations are performed with the existing regression functions (glm and nls) implemented in R statistical software (R Development core team 2008) in the R package stats.

For the 2SRI method, a two-step regression is used following the procedure outlined in the section dedicated to IV estimation procedure. Recall that the covariance matrices of \(\hat {\beta }\) from the second step regression for both methods retrieved from the software results will not be valid since their calculation ignores the fact that some estimated parameters from the first-stage regression are included in the second stage. Then, we re-evaluated these covariance matrices using the sequential two-step estimation procedure taking into account the fact that an estimated variable is used in the second step. The computation of these covariance matrices is given in Appendix A.

For the GMM, the R package gmm proposed by Chaussé [32] is a very helpful tool for computating parameters and estimating covariance. The user needs to implement the sample version of the moment condition function at Eq. (5) and its gradient if possible; if not, a numerical approximation of the gradient function will be used by the gmm function to perform the estimation.

From these estimations, we calculate the asymptotic covariance matrices and the corresponding confidence intervals whose levels are evaluated below.

Evaluation criteria

To evaluate the performances of the various methods, we consider several criteria including the percentage of relative bias (rB in %) defined as

$$\text{rB} = 100*\frac{1}{ns}\sum_{j=1}^{ns}\left(\frac{\hat{\beta}_{t}^{(j)}}{\beta_{t}}-1\right);$$

the asymptotic standard deviation estimated by the square root of the Monte-Carlo mean of variance \(\bar {\hat {\sigma }}^{2} = \frac {1}{ns}\sum _{j=1}^{ns} \hat {\sigma }_{j}^{2},\) with \(\hat {\sigma }_{j}^{2}\) the asymptotic variance of \(\hat {\beta }_{t}^{(j)}\) and the Monte-Carlo estimator of the true variance \({\mathbb {V}ar}(\hat {\beta }_{t}) = \frac {1}{ns-1}\sum _{j=1}^{ns} \left (\hat {\beta }_{t}^{(j)}- \bar {\hat {\beta }_{t}}\right)^{2}\). We also consider the square root of the mean squares error rMSE given by

$$\text{rMSE} = \sqrt{\frac{1}{ns}\sum_{j=1}^{ns}\left(\hat{\beta}_{t}^{(j)}-\beta_{t} \right)^{2}},$$

and the lower and upper non-coverage probabilities (in %) defined as

$${Er}_{inf} = 100*\frac{1}{ns}\sum_{j=1}^{ns}1_{\left[ \beta_{t} < {IC}_{inf}^{(j)}\right]}$$


$${Er}_{sup} = 100*\frac{1}{ns}\sum_{j=1}^{ns}1_{\left[ \beta_{t} > {IC}_{sup}^{(j)}\right]} $$

where \( IC^{(j)} = \left [{IC}_{inf}^{(j)};{IC}_{sup}^{(j)}\right ]\) denotes the confidence interval of β t from the jth Monte-Carlo sample using the asymptotic distribution of \(\hat {\beta }_{t}^{(j)}\). The nonparametric bootstrap based estimate of variance and non-coverage probabilities are also investigated in these simulations and the results are analysed below. We complete all these criteria by the equivalent of the first-stage F-statistics in linear regression (see [33]) testing instrument exclusion in the treatment choice model.


Table 1 summarizes the performances of each method in terms of relative bias (rB), the standard deviation (sd), the square root of mean squares error (rMSE) and the non-coverage probabilities (pval=Er inf +Er sup ). It presents the results related to instrument pr. There were only slight differences between the results with instrument pr and those with Z, so we omitted the results related to instrument Z. For some samples, the GMM fails to converge owing to singularity problems in the covariance matrix. The estimations from these samples are simply removed (cases marked ‘ −’ in Table 1) for all methods. For the other scenarios, infinite variance estimate or outlier coefficient estimate may be observed; the corresponding samples were dropped for the calculation of the criteria. Table 2 shows the number of samples leading to an outlier estimation (rB>100% or infinite variance) among the 1000 simulated samples. Figure 1 complements the results in Table 1 by displaying the boxplot distributions of rB. Each series of letters a,b,c and d corresponds to the results related to an instrument strength with each letter corresponding to a method as detailed in the legend of Fig. 1. In Table 1 the sd values refer to the Monte-Carlo-based standard deviation. Except for the GMM estimator where outlier values for the asymptotic variance were observed, the Monte-Carlo-based standard deviation, the bootstrap-based estimate (not shown) and the asymptotic estimate were very close.

Fig. 1

Relative bias (rB) of the methods. a: True model b: conventional model; c : 2SRI with instrument pr; d: GMM with instrument pr. Low, Medium and High indicate the corresponding level of confounding and the instrument strength grows from a, b, c, d sequence to the next (from left to right)

Table 1 Performances of methods using instrument pr
Table 2 Number of samples among 1000 leading to outliers in GMM estimation

As expected, the relative bias shows that the estimation from the true and conventional models are insensitive to instrument strength but the confounding level affects the estimation in the conventional model: the relative bias increases with level of confounding. In the presence of a strong instrument, the 2SRI tends to improve the estimate when the level of confounding increases. This trend is reversed when the instrument is weak, i.e. the relative bias and the confounding level have the same direction of variation. The percentage of relative bias of the GMM does not seem too sensitive to instrument strength: it just changes slightly when the strength of the instrument grows. However, this bias increases with the magnitude of confounding, which shows the impact of endogeneity on this method (See Fig. 1).

For the standard deviation (sd) and the square root of the mean squares error (rMSE), the asymptotic results (n=30000) show that both criteria decrease when the level of confounding or instrument strength grows, this being the case for all methods except the GMM for which the rMSE decreases very slowly or remains almost constant in some cases. This trend confirms the already obverved low sensitivity of the GMM to the strength of the instruments used in this simulation. Even though the 2SRI has the larger sd than the other methods in all scenarios with an impact on rMSE in several cases, rMSE for 2SRI method seems improved with high confounding and a strong instrument.

Concerning the non coverage probabilities (pval), the true model estimation and the 2SRI displayed an estimated non-coverage probability around the nominal level of 5% in almost all scenarios. Their values ranged from 4 to 6% and reached 7% in rare cases. The non-coverage probability was very large for the other methods, even with large samples: the results showed an overestimation of the coefficient of treatment for the GMM and conventional approaches. Even at low confounding levels, the conventional method yields very poor coverage probabilities which is coherent with what was observed for the relative bias.

We observe that the metric of the instrument we use retains the same direction of variation with the F-statistics eequivalents (Tables 3 and 4 of Appendix C). Finally, Table 5 in Appendix D summarizes the performances of each method for more balanced exposure frequencies and large sample size (n=30000). In general, performances are less good in comparison with small exposure frequency situation. One could also note that numerical problems arise more often (see Table 6 of Appendix D). Nevertheless, 2SRI is better in term of relative bias and non coverage probability than the conventional method and GMM.

Overall, the results are satisfactory for the 2SRI approach which achieves a similar level of performance to the true model regarding estimation of the confidence interval in the imbalanced situation. We close this section by pointing out the strong numerical instability observed when computing the GMM estimator during these simulations. This could explain the modest performance displayed by the GMM in the simulation results.


In this paper, we focus on the effectiveness of regression coefficient estimation in a context of endogeneity, particularly the endogeneity due to unobserved confounding. We are interested in the coefficient of an endogenous treatment which is the basis of risk assessment in pharmacoepidemiology. Linear models are often used in this context to model the probability of dichotomous events (see [7]). Through a simulation study, we investigate the behavior of parameter estimation in nonlinear models specifically logistic regression using some IV-based methods that could potentially be used to overcome the endogeneity issue. The simulation study also made it possible to assess two preference-based instrumental variables in pharmacoepidemiology.

The results reported from the simulation study show that the 2SRI using nonlinear regression at each stage is an interesting alternative for estimating the coefficient of the endogenous treatment in a logistic regression model. It is very simple to implement and yields satisfactory results regarding the bias and the confidence interval estimate. It was found to yield the most accurate estimate of non-coverage probabilities and thus a more accurate estimate of confidence intervals among the IV-based methods that were compared. However, the conventional approach behaved better than the 2SRI in some cases, especially when the confounding level was weak. We believe that in these cases, the level of confounding is not sufficiently high to require the use of an instrument in the estimation. However, to our knowledge there is still no way to assess the level of unmeasured confounding. For the GMM, the estimation procedure was remarkably unstable. That instability may be attributable to the dichotomous nature of the variables (outcome and exposure) in the context of pharmacoepidemiology with the preference-based instrument. This situation makes the GMM approach is not to be recommended in this context unless another instrument has proved to behave satisfactorily with this method.

Concerning the instruments under investigation in this study, the proportion of all previous patients of the same physician who were prescribed the treatment of interest proved to be a good proxy of the physician’s preference instrument. This instrument was previously considered by Abrahamowicz and colleagues [22] for investigating the detection of a possible change point in the physician’s preference and their results also seemed satisfactory. This proxy of the physician’s preference instrument is thus a credible alternative to the well known other proxy based only on the single patient of the same physician who was prescribed the treatment of interest.

Even though this work throws light on the performances of IV estimators in the context of a nonlinear model with endogeneity, more work is needed to explore the behavior of these estimators in other contexts when the prevalence of exposure and /or outcome varies.


In observational studies, when assessing the effect of drug exposure on a dichotomous outcome, investigators could use appropriate 2SRI estimation to account for unmeasured confounding. This work showed that two logistic regressions as well as a physician’s preference proxy for IV yeald satisfactory results.

Appendix A: Asymptotic variance of the nonlinear 2SRI

Using the sequential two-step estimation procedure, the nonlinear 2SRI estimator minimizes the least squares criterion

$$ Q(\beta) = \frac{1}{2n}\sum_{i=1}^{n} \left(y_{i}-F_{\hat{\alpha}}\left(X_{i\hat{\alpha}}^{\prime}\beta\right)\right)^{2} $$

where \(\hat {\alpha }\) denotes the nonlinear least squares estimator of α from the first stage. If we set \(\eta _{i\beta } = X_{i\hat {\alpha }}^{\prime }\beta,\) the first-order condition in β is given by

$$ - \frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\hat{\alpha}}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \left(y_{i}-F_{\hat{\alpha}}(\eta_{i\beta})\right) = 0. $$

Given that \(\hat {\alpha }\) is a consistent estimator of α0, the Taylor-lagrange expansion of (9) arround the true value (α0,β0) gives

$$\begin{aligned} 0 = \frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta}} \left(y_{i}-F_{\alpha}(\eta_{i\beta})\right)_{(\alpha_{c};\beta_{c})} +\\ \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial^{2} F_{\alpha}}{\partial \eta_{i\beta} \partial \eta_{i\beta}^{'} }(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\alpha}} \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}} \left(y_{i}-F_{\alpha}(\eta_{i\beta})\right) \right)_{(\alpha_{c};\beta_{c})} (\hat{\alpha}-\alpha_{0}) +\\ \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial^{2} \eta_{i\beta}}{\partial{\alpha} \partial{\beta^{\prime}} } \left(y_{i}-F_{\alpha}(\eta_{i\beta})\right) - \frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\alpha} } \frac{\partial \eta_{i\beta}}{\partial{\beta{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{c};\beta_{c})} (\hat{\alpha}-\alpha_{0}) +\\ \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial^{2} F_{\alpha}}{\partial \eta_{i\beta} \partial \eta_{i\beta}^{\prime} }(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta}} \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}} \left(y_{i}-F_{\alpha}(\eta_{i\beta})\right) - \frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \frac{\partial \eta_{i\beta}}{\partial{\beta{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{c};\beta_{c})} (\hat{\beta}-\beta_{0}) \end{aligned} $$

for some (α c ;β c ) between \((\hat {\alpha };\hat {\beta })\) and (α0;β0). Under the assumption \({\mathbb {E}}\left (\frac {\partial {Q}}{\partial {\beta }}\right)_{\beta _0} = 0\) the terms involving the residuals \(\left (y_{i}-F_{\alpha }(\eta _{i\beta })\right)\) in the above expansion, except the first, all tend in probability to zero and we obtain

$$\begin{aligned} \sqrt{n}(\hat{\beta}-\beta_{0}) \approx \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}^{-1} \left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta}} \left(y_{i}-F_{\alpha}(\eta_{i\beta})\right)\right)_{(\alpha_{0};\beta_{0})}\\ - \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}^{-1} \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\alpha} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})} (\hat{\alpha}-\alpha_{0}). \end{aligned} $$

The quantity \(\sqrt {n}(\hat {\beta }-\beta _{0})\) may then be written

$$\begin{aligned} \sqrt{n}(\hat{\beta}-\beta_{0}) &\approx \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}^{-1} \left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta}} \left(y_{i}-F_{\alpha}(\eta_{i\beta})\right)\right)_{(\alpha_{0};\beta_{0})}\\ &\quad- \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}^{-1} \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\alpha} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}\\ &\quad\times \left(\frac{1}{n}\sum_{i=1}^{n} \frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha})\frac{\partial \eta_{i\alpha}}{\partial{\alpha} } \frac{\partial \eta_{i\alpha}}{\partial{\alpha^{\prime}}}\frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha}) \right)_{\alpha_0}^{-1} \left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n} \frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha})\frac{\partial \eta_{i\alpha}}{\partial{\alpha}} \left(T_{i}-r(\eta_{i\alpha})\right)\right)_{\alpha_0} \end{aligned} $$

with \(\eta _{i\alpha } = w_{i}^{\prime }\alpha \). The latter is derived from the asymptotic approximation of \(\sqrt {n}(\hat {\alpha }-\alpha _{0})\) using a similar Taylor-Lagrange expansion for the first-stage nonlinear least squares regression. The previous expansion leads to the covariance matrix of \(\hat {\beta }\) of the form

$${} \begin{aligned} {\mathbb{V}ar}(\hat{\beta}) &= \frac{1}{n} \left(A_{22}^{-1} S_{2} A_{22}^{-1^{\prime}} \,+\, A_{22}^{-1}A_{21} A_{11}^{-1} S_{1}A_{11}^{-1^{\prime}}A_{21}^{\prime} A_{22}^{-1^{\prime}}\right.\\ &\left.\quad- A_{22}^{-1} S_{21} A_{11}^{-1^{\prime}}A_{21}^{\prime} A_{22}^{-1^{\prime}}\right). \end{aligned} $$

Under the assumption of independence between observations, the matrices involved in this covariance matrix are given by

$${} \begin{aligned} A_{11} &\,=\, p\lim\! \left(\frac{1}{n} \sum_{i=1}^{n} \frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha})\frac{\partial \eta_{i\alpha}}{\partial{\alpha} } \frac{\partial \eta_{i\alpha}}{\partial{\alpha^{\prime}}}\frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha}) \right)_{\alpha_{0}}\\ A_{21} &\,=\, p\lim\! \left(\frac{1}{n} \sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\alpha} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}\\ A_{22} &\,=\, p\lim\! \left(\frac{1}{n} \sum_{i=1}^{n} \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\eta_{i\beta}) \right)_{(\alpha_{0};\beta_{0})}\\ S_{1} &\,=\, p\lim\! \left(\frac{1}{n} \sum_{i=1}^{n} \frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha})\frac{\partial \eta_{i\alpha}}{\partial{\alpha} } U_{i\alpha}^{2}\frac{\partial \eta_{i\alpha}}{\partial{\alpha^{\prime}}}\frac{\partial r}{\partial \eta_{i\alpha}}(\eta_{i\alpha}) \right)_{\alpha_0}\\ S_{2} &= p\lim\! \left(\!\frac{1}{n} \sum_{i=1}^{n}\! \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\!\eta_{i\beta})\frac{\partial \eta_{i\beta}}{\partial{\beta} } U_{i\beta}^{2} \frac{\partial \eta_{i\beta}}{\partial{\beta^{\prime}}}\frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\!\eta_{i\beta})\!\right)_{\!(\alpha_{0};\beta_{0})}\\ S_{21} &\,=\, p\lim\! \left(\!\frac{1}{n} \!\sum_{i=1}^{n}\! \frac{\partial F_{\alpha}}{\partial \eta_{i\beta}}(\!\eta_{i\beta}\!)\frac{\partial \eta_{i\beta}}{\partial{\beta} } U_{i\beta} U_{i\alpha} \frac{\partial \eta_{i\alpha}}{\partial{\alpha^{\prime}}}\frac{\partial r}{\partial \eta_{i\alpha}}(\!\eta_{i\alpha}\!)\!\!\right)_{\!(\alpha_{0};\beta_{0})} \end{aligned} $$

where \(p\lim \) denotes the limite in probability, U α =TP α , U β =yP β , with P α =r(wα) and P β =F α (Xβ).

An estimation of this covariance matrix can be obtained using the plug-in estimator of each matrix involved in its expression. Let \(P_{\hat {\alpha }} = r(w\hat {\alpha }), P_{\hat {\beta }} = F_{\hat {\alpha }}(X\hat {\beta }),\ U_{\hat {\alpha }} = T-P_{\hat {\alpha }}\) and \( U_{\hat {\beta }} = y-P_{\hat {\alpha }},\) then the corresponding estimator of matrices at equation (10) are such that

$$\begin{aligned} \hat{A}_{11} &= \frac{1}{n} \left[P_{\hat{\alpha}}(1-P_{\hat{\alpha}})w \right]^{\prime} [P_{\hat{\alpha}}(1-P_{\hat{\alpha}})w],\\ \hat{A}_{21} &= \frac{\hat{\beta}_{u}}{n} \left[P_{\hat{\beta}}^{2}\left(1-P_{\hat{\beta}}\right)^{2}X\right]^{\prime} [P_{\hat{\alpha}}(1-P_{\hat{\alpha}})w]\\ \hat{A}_{22} &= \frac{1}{n} \left[P_{\hat{\beta}}\left(1-P_{\hat{\beta}}\right)X\right]^{\prime} \left[P_{\hat{\beta}}\left(1-P_{\hat{\beta}}\right)X\right],\\ \hat{S}_{1} &= \frac{1}{n} \left[P_{\hat{\alpha}}(1-P_{\hat{\alpha}})U_{\hat{\alpha}}w\right]^{\prime} [P_{\hat{\alpha}}(1-P_{\hat{\alpha}})U_{\hat{\alpha}}w]\\ \hat{S}_{2} &= \frac{1}{n} \left[P_{\hat{\beta}}(1-P_{\hat{\beta}})U_{\hat{\beta}}X\right]^{\prime} \left[P_{\hat{\beta}}\left(1-P_{\hat{\beta}}\right)U_{\hat{\beta}}X\right],\\ \hat{S}_{21} &= \frac{1}{n} \left[P_{\hat{\beta}}\left(1-P_{\hat{\beta}}\right)U_{\hat{\beta}}X\right]^{\prime} [P_{\hat{\alpha}}(1-P_{\hat{\alpha}})U_{\hat{\alpha}}w]. \end{aligned} $$

Appendix B: Instrument strength and confounding level

We give bellow the computation of instrument strength and confounding level

  • Instrument strength

    The strength of an instrument results from the correlation between it and the corresponding endogenous variable. In the model considered here, the strength of an instrument Z is given by \({\mathbb {C}orr}(T,Z) = \frac {\mathbb {C}ov(T,Z)}{\sqrt {\mathbb {V}ar(T)}\sqrt {\mathbb {V}ar(Z)}}\). As the treatment has a causal link with other covariables Z,X1,X2 and X u , we have

    $${} \begin{aligned} {\mathbb{V}ar}(T) &= {\mathbb{V}ar}_{Z} [\!{\mathbb{E}}(T|Z,X_{1},X_{2},X_{u})]\\ &\quad+ {\mathbb{E}}_{Z}[{\mathbb{V}ar}(T|Z,X_{1},X_{2},X_{u})]. \end{aligned} $$

    If we consider only the explanatory effect of the instrument in treatment T and replace other covariables and the confounder by their average effect, we have \({\mathbb {V}ar}(T) = {\mathbb {V}ar}_{Z} \left (\frac {1}{1+A_Z}\right) + {\mathbb {E}}_{Z}\left (\frac {A_Z}{(1+A_{Z})^{2}}\right)\) with A Z =exp(−(α0+Zα z +μ1α1+μ2α2+μ u )),\(\mu _{1} = {\mathbb {E}}(X_{1}), \mu _{2} = {\mathbb {E}}(X_{2})\) and \(\mu _{u} = {\mathbb {E}}(X_{u})\). This variance may then be written

    $$\begin{array}{*{20}l} {}{\mathbb{V}ar}(T) &=& {\mathbb{E}}_{Z} \left(\!\frac{1}{1+A_Z}\!\right)^{2}\,-\, \left(\!{\mathbb{E}}_{Z} \left(\!\frac{1}{1+A_Z}\!\right)\!\right)^{2}\\ &&+ {\mathbb{E}}_{Z}\!\left(\!\frac{A_Z}{(1+A_{Z})^{2}}\!\right) \end{array} $$
    $$\begin{array}{*{20}l} &=& {\mathbb{E}}_{Z} \!\left(\!\frac{1+A_Z}{(1+A_{Z})^{2}}\!\right)\,-\, \left(\!{\mathbb{E}}_{Z}\! \left(\!\frac{1}{1+A_Z}\!\right)\!\right)^{2} \end{array} $$
    $$\begin{array}{*{20}l} &=& {\mathbb{E}}_{Z}\! \left(\!\frac{1}{1+A_Z}\!\right)\! -\! \left(\!{\mathbb{E}}_{Z}\! \left(\!\frac{1}{1+A_Z}\!\right)\!\right)^{2}. \end{array} $$

    Considering a dichotomous instrument Z having the Bernoulli distribution B(p), we have \({\mathbb {E}}_{Z}\left (\frac {1}{1+A_Z}\right) = \frac {p}{1+A_1} + \frac {1-p}{1+A_0},\) where A j ,j=0,1 is A Z with Z replaced by j. We finally obtain

    $${} \begin{aligned} {\mathbb{V}ar}(T) = \left(\!\frac{p}{1+A_1} + \frac{1-p}{1+A_0}\!\right)\!\left(\!1\,-\,\left(\!\frac{p}{1+A_1} + \frac{1-p}{1+A_0}\right)\!\right)\!. \end{aligned} $$

    We also have

    $$\begin{array}{*{20}l} {\mathbb{E}}(T) &=& {\mathbb{E}}_{Z}({\mathbb{E}}(T|Z,X_{1},X_{2},X_{u})) \end{array} $$
    $$\begin{array}{*{20}l} &=& {\mathbb{E}}_{Z}\left(\frac{1}{1+A_Z}\right) \end{array} $$
    $$\begin{array}{*{20}l} &=& \frac{p}{1+A_1} + \frac{1-p}{1+A_0} \end{array} $$

    and then \({\mathbb {E}}(Z){\mathbb {E}}(T) = \frac {p^{2}}{1+A_1} + \frac {p(1-p)}{1+A_0}\). Furthermore, \({\mathbb {E}}(ZT) = {\mathbb {E}}_{Z}({\mathbb {E}}(ZT|Z,X_{1},X_{2},X_{u})) = {\mathbb {E}}_{Z}(Z{\mathbb {E}}(T|Z,X_{1},X_{2},X_{u}))\) which leads to \({\mathbb {E}}(ZT) ={\mathbb {E}}_{Z}\left (\frac {Z}{1+A_Z}\right) = \frac {p}{1+A_1}\). The covariance between Z and T is then given by

    $$\begin{array}{*{20}l} {\mathbb{C}ov}(Z,T) &=& {\mathbb{E}}(ZT) - {\mathbb{E}}(Z){\mathbb{E}}(T) \end{array} $$
    $$\begin{array}{*{20}l} &=& \frac{p}{1+A_1} - \frac{p^{2}}{1+A_1} - \frac{p(1-p)}{1+A_0}\end{array} $$
    $$\begin{array}{*{20}l} &=& p(1-p)\left(\frac{1}{1+A_1} - \frac{1}{1+A_0}\right). \end{array} $$

    The correlation between Z and T is

    $$ {} {\mathbb{C}orr}(T,Z) = \frac{\left(\frac{1}{1+A_1}-\frac{1}{1+A_0}\right)\sqrt{p(1-p)}}{\sqrt{\left(\frac{p}{1+A_1}+\frac{1-p}{1+A_0}\right)\left(1-\left(\frac{p}{1+A_1}+\frac{1-p}{1+A_0}\right)\right)}}. $$

    Then for a given instrument Z with p fixed in (20), α0, α z ,α1 and α2 can be chosen to reach a value of A Z that leads to a desired value of \({\mathbb {C}orr}(T,Z)\).

    Another criterion that could be used to quantify an instrument’s strength is the correlation between T=α0+Zα z +X1α1+X2α2+X u and the instrument Z. Since \({\mathbb {C}ov}(Z,T^{*}) ={\mathbb {C}ov}(Z,\alpha _{0} + Z \alpha _{z}+ X_{1}\alpha _{1}+ X_{2}\alpha _{2}+ X_{u}) = \alpha _{z}{\mathbb {V}ar}(Z)\) and \( {\mathbb {V}ar}(T^{*}) = \alpha _{z}^{2}{\mathbb {V}ar}(Z) + \alpha _{1}^{2}{\mathbb {V}ar}(X_{1}) + \alpha _{2}^{2}{\mathbb {V}ar}(X_{2}) + {\mathbb {V}ar}(X_{u}),\) under the assumptions on these variables, we have

    $$ {\mathbb{C}orr}(Z,T^{*}) = \frac{\alpha_{z}\sqrt{p(1-p)}}{\left(\alpha_{z}^{2}p(1-p) + \alpha_{1}^{2} \sigma_{1}^{2} +\alpha_{2}^{2} \sigma_{2}^{2} + \sigma_{u}^{2} \right)^{1/2}} $$

    where \(\sigma _{i}^{2} ={\mathbb {V}ar}(X_{i}) \) and \(\sigma _{u}^{2} = {\mathbb {V}ar}(X_{u})\).

  • Confounding level

    A straightforward calculation as above leads to the following correlation between T and X u that expresses the level of confounding.

    $${} {\mathbb{C}orr}(X_{u},T^{*}) = \frac{\sigma_u}{\left(\alpha_{z}^{2}p(1-p) + \alpha_{1}^{2} \sigma_{1}^{2} +\alpha_{2}^{2} \sigma_{2}^{2} + \sigma_{u}^{2} \right)^{1/2}}. $$

Appendix C: F-statistics for each scenario

The following tables display the equivalent of the first-stage F-statistics in linear regression (see [33]) testing instrument exclusion in the treatment choice model.

Table 3 Monte-carlo mean of F-statistics in each scenario using the proportion of patients who received the same treatment as proxy of instrument
Table 4 Monte-carlo mean of F-statistics in each scenario using the treatment prescribed to the last patient as proxy of instrument
Table 5 Performances of methods using instrument pr

Appendix D: Description of simulation model and parameters values, results of the second scenario

For all scenarios, the model generating the binary outcome is the index function

$$Y_{i} = 1(Y_{i}^{*}-\varepsilon_{i}>0), $$

with \(Y_{i}^{*} = \beta _{0} + T_{i}\beta _{t}+ X_{1i}\beta _{1}+ X_{2i}\beta _{2}+ X_{ui}\beta _{u} \) and ε i the standart logistic distribution. T i is the observed binary treatment of the individual i,X1i and X2i some characteristics of patient i and X ui the unmeasured confounding factor. Besides β0 the intercept, β0,β t ,β1,β2 and β u are parameters related to T, X1, X2 and X u respectively. We fixed these parameters β0=−0.6, β t =3, β1=1, β2=1, and β u =1 to keep the prevalence of event less than 5%.

Table 6 Number of samples among 500 leading to outliers in GMM estimation

The treatment choice for the ith patient was generated from a Bernoulli model with success probability p i which depends on the patient’s characteristics X1N(−2,1) and X2N(−3,1), on a binary instrument Zb(0.7) and on the confounding factor X u N(0,σ u ). The probability p i is given by p i =F(α0+Z i α z +X1iα1+X2iα2+X ui ), where F denotes the logistic distribution function. The standard deviation σ u of the confounding factor takes values 0.5, 1 and 1.5 corresponding to Low, Medium and high level of confounding respectively. For a fixed level of confounding, we varied only α z value over {1,2,3} of which each element corresponds to an instrument strength: 1 for “Low” instrument, 2 for “Moderate” and 3 for “High” instrument. All other parameters in the treatment choice model remain constant (α0=0.2, α1=2 and α2=1.2). The data are generated using the R function sim2Logit2().

We investigated the performances of methods in the context of rare exposures (2 to 6%), and rare events (less than 5%). To check whether the results obtained remain valid in the context of higher exposure (near 50%), we design new simulations in which only the intercepts α0 and β0 are modified in the previous design. We held all other parameters constant and fixed α0=5 and β0 = −2.3. The prevalence of exposure ranged then between 26% and 45% whereas that of event is maintained lower than 6%. We present in Table 5 the results from this second scope over 500 Monte Carlo samples of size 30000. Table 6 displays the number of Monte Carlo samples with outliers.

This function allows to simulate a compound logistic model with covariates



Two-Stage Least Squares


Two-Stage Predictor Substitution


Two-Stage Residual Inclusion


Generalized Method of Moment


Institut national de la santé et de la recherche médicale


Instrumental variable


Odd Ratio


Physician Preference


non coverage probability


Relative Bias


root Mean Square Error


standard deviation


Université de Versailles Saint-Quentin-en-Yvelines


  1. 1

    Rassen JA, Schneeweiss S, Glynn RJ, Mittleman MA, Brookhart MA. Instrumental variable analysis for estimation of treatment effects with dichotomous outcomes. Am J Epidemiol. 2009; 169(3):273–84.

    Article  PubMed  Google Scholar 

  2. 2

    Gowrisankaran G, Town RJ. Estimating the quality of care in hospitals using instrumental variables. J Health Econ. 1999; 18(6):747–67.

    Article  PubMed  CAS  Google Scholar 

  3. 3

    Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Monographs on Statistics and Applied Probability. London: Chapman & Hall; 1995.

    Google Scholar 

  4. 4

    Nestler S. Using instrumental variables to estimate the parameters in unconditional and conditional second-order latent growth models. Struct Equ Model A Multidiscip J. 2015; 22(3):461–73.

    Article  Google Scholar 

  5. 5

    Greene WH. Econometric Analysis, 7th edition, international edition. Boston: Pearson; 2012. pp. 259–89. Previous edition: 2008.

    Google Scholar 

  6. 6

    Bound J, Jaeger DA, Baker RM. Problems with instrumental variables estimation when the correlation between the instruments and the endogeneous explanatory variable is weak. J Am Stat Assoc. 1995; 90(430):443–50.

    Google Scholar 

  7. 7

    Terza JV, Bradford WD, Dismuke CE. The use of linear instrumental variables methods in health services research and health economics: A cautionary note. Health Serv Res. 2008; 43(3):1102–1120.

    Article  PubMed  PubMed Central  Google Scholar 

  8. 8

    Foster EM. Instrumental Variables for Logistic Regression: An Illustration. Soc Sci Res. 1997; 26(4):487–504. Accessed 02 Oct 2015.

    Article  Google Scholar 

  9. 9

    Terza JV. Estimation of policy effects using parametric nonlinear models: a contextual critique of the generalized method of moments. Health Serv Outcome Res Methodol. 2006; 6(3):177–98.

    Article  Google Scholar 

  10. 10

    Cai B, Small DS, Have TRT. Two-stage instrumental variable methods for estimating the causal odds ratio: Analysis of bias. Stat Med. 2011; 30(15):1809–1824.

    Article  PubMed  Google Scholar 

  11. 11

    Klungel OH, Martens EP, Psaty BM, Grobbee DE, Sullivan SD, Stricker BHC, Leufkens HGM, de Boer A. Methods to assess intended effects of drug treatment in observational studies are reviewed. J Clin Epidemiol. 2004; 57:1223–1231.

    Article  PubMed  Google Scholar 

  12. 12

    Palmer TM, Sterne JAC, Harbord RM, Lawlor DA, Sheehan NA, Meng S, Granell R, Smith GD, Didelez V. Instrumental variable estimation of causal risk ratios and causal odds ratios in mendelian randomization analyses. Am J Epidemiol. 2011; 173(12):1392.

    Article  PubMed  Google Scholar 

  13. 13

    Chapman CG, Brooks JM. Treatment effect estimation using nonlinear two-stage instrumental variable estimators: Another cautionary note. Health Serv Res. 2016; 51(6):2375–394.

    Article  PubMed  PubMed Central  Google Scholar 

  14. 14

    Johnston KM, Gustafson P, Levy AR, Grootendorst P. Use of instrumental variables in the analysis of generalized linear models in the presence of unmeasured confounding with applications to epidemiological research. Stat Med. 2008; 27(9):1539–1556.

    Article  PubMed  CAS  Google Scholar 

  15. 15

    Greenland S. An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000; 29(4):722–9.

    Article  PubMed  CAS  Google Scholar 

  16. 16

    McClellan M, McNeil BJ, Newhouse JP. Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality?: Analysis using instrumental variables. JAMA. 1994; 272(11):859–66.

    Article  PubMed  CAS  Google Scholar 

  17. 17

    Cain LE, Cole SR, Greenland S, Brown TT, Chmiel JS, Kingsley L, Detels R. Effect of highly active antiretroviral therapy on incident aids using calendar period as an instrumental variable. Am J Epidemiol. 2009; 169(9):1124–1132.

    Article  PubMed  PubMed Central  Google Scholar 

  18. 18

    Brookhart MA, Schneeweiss S. Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. Int J Biostat. 2007; 3(1):14.

    Article  PubMed Central  Google Scholar 

  19. 19

    Brooks JM, Chrischilles EA, Scott SD, Chen-Hardee SS. Was breast conserving surgery underutilized for early stage breast cancer? instrumental variables evidence for stage ii patients from iowa. Health Serv Res. 2003; 38(6p1):1385–1402.

    Article  PubMed  PubMed Central  Google Scholar 

  20. 20

    Johnston SC. Combining ecological and individual variables to reduce confounding by indication:: Case study–subarachnoid hemorrhage treatment. J Clin Epidemiol. 2000; 53(12):1236–1241.

    Article  PubMed  CAS  Google Scholar 

  21. 21

    Brookhart MA, Wang PS, Solomon DH, Schneeweiss S. Evaluating short-term drug effects using a physician-specific prescribing preference as an instrumental variable. Epidemiology. 2006; 17(3):268–75.

    Article  PubMed  PubMed Central  Google Scholar 

  22. 22

    Abrahamowicz M, Beauchamp ME, Ionescu-Ittu R, Delaney JAC, Pilote L. Reducing the variance of the prescribing preference-based instrumental variable estimates of the treatment effect. Am J Epidemiol. 2011; 174(4):494–502.

    Article  PubMed  Google Scholar 

  23. 23

    Baiocchi M, Cheng J, Small DS. Instrumental variable methods for causal inference. Stat Med. 2014; 33(13):2297–340.

    Article  PubMed  PubMed Central  Google Scholar 

  24. 24

    Uddin MJ, Groenwold RHH, de Boer A, Gardarsdottir H, Martin E, Candore G, Belitser SV, Hoes AW, Roes KCB, Klungel OH. Instrumental variables analysis using multiple databases: an example of antidepressant use and risk of hip fracture. Pharmacoepidemiol Drug Saf. 2016; 25:122–31. PDS-14-0390.R2.

    Article  PubMed  CAS  Google Scholar 

  25. 25

    Uddin MJ, Groenwold RHH, de Boer A, Afonso ASM, Primatesta P, Becker C, Belitser SV, Hoes AW, Roes KCB, Klungel OH. Evaluating different physician’s prescribing preference based instrumental variables in two primary care databases: a study of inhaled long-acting beta2-agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf. 2016; 25:132–41. PDS-14-0383.R2.

    Article  PubMed  CAS  Google Scholar 

  26. 26

    Brookhart MA, Rassen JA, Schneeweiss S. Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiol Drug Saf. 2010; 19(6):537–54.

    Article  PubMed  PubMed Central  Google Scholar 

  27. 27

    Terza JV, Basu A, Rathouz PJ. Two-Stage Residual Inclusion Estimation: Addressing Endogeneity in Health Econometric Modeling. J Health Econ. 2008; 27(3):531–43. Accessed 02 Oct 2015.

    Article  PubMed  Google Scholar 

  28. 28

    Cameron AC, Trivedi PK. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press; 2005. pp. 166–220.

    Google Scholar 

  29. 29

    Hansen LP, Heaton J, Yaron A. Finite-sample properties of some alternative gmm estimators. J Bus Econ Stat. 1996; 14(3):262–80.

    Google Scholar 

  30. 30

    Amemiya T. The nonlinear two-stage least-squares estimator. J Econ. 1974; 2(2):105–10.

    Article  Google Scholar 

  31. 31

    Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982; 50(4):1029–1054.

    Article  Google Scholar 

  32. 32

    Chausse P. Computing generalized method of moments and generalized empirical likelihood with r. J Stat Softw. 2010; 34(1):1–35.

    Google Scholar 

  33. 33

    Stock JH, Wright JH, Yogo M. A survey of weak instruments and weak identification in generalized method of moments. J Bus Econ Stat. 2002; 20(4):518–29.

    Article  Google Scholar 

Download references


This work was supported by grants from the Fondation pour la Recherche Médicale (“Iatrogénie des médicaments 2013”) and the Agence Nationale pour la Recherche (“Appel à projet générique 2015”). The funding bodies had no role in the contents and the writing of the manuscript.

Availability of data and materials

The R programm for simulation are available from the corresponding author.

Author information




BFK and PTB conceived the study. BFK performed the simulation study and drafted the manuscript. BFK, PTB and SE interpreted the results, read and approved the final manuscript.

Corresponding author

Correspondence to Babagnidé François Koladjo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Koladjo, B., Escolano, S. & Tubert-Bitter, P. Instrumental variable analysis in the context of dichotomous outcome and exposure with a numerical experiment in pharmacoepidemiology. BMC Med Res Methodol 18, 61 (2018).

Download citation


  • Instrumental variable
  • Nonlinear least squares
  • Logistic regression
  • Physician’s prescription preference
  • Pharmacoepidemiology
  • Observational studies
  • Simulation study