 Research Article
 Open Access
 Open Peer Review
 Published:
Metaanalytic estimation of measurement variability and assessment of its impact on decisionmaking: the case of perioperative haemoglobin concentration monitoring
BMC Medical Research Methodology volume 16, Article number: 7 (2016)
Abstract
Background
As a part of a larger Health Technology Assessment (HTA), the measurement error of a device used to monitor the hemoglobin concentration of a patient undergoing surgery, as well as its decision consequences, were to be estimated from published data.
Methods
A Bayesian hierarchical model of measurement error, allowing the metaanalytic estimation of both central and dispersion parameters (under the assumption of normality of measurement errors) is proposed and applied to published data; the resulting potential decision errors are deduced from this estimation. The same method is used to assess the impact of an initial calibration.
Results
The posterior distributions are summarized as mean ± sd (credible interval). The fitted model exhibits a modest mean expected error (0.24 ± 0.73 (−1.23 1.59) g/dL) and a large variability (mean absolute expected error 1.18 ± 0.92 (0.05 3.36) g/dL). The initial calibration modifies the bias (−0.20 ± 0.87 (−1.99 1.49) g/dL), but the variability remains almost as large (mean absolute expected error 1.05 ± 0.87 (0.04 3.21) g/dL). This entails a potential decision error (“false positive” or “false negative”) for about one patient out of seven.
Conclusions
The proposed hierarchical model allows the estimation of the variability from published aggregates, and allows the modeling of the consequences of this variability in terms of decision errors. For the device under assessment, these potential decision errors are clinically problematic.
Background
The CEDIT^{1} is a Health Technology Assessment (HTA) agency within the University Hospitals in Paris (APHP^{2}). It is in charge since 1982 of advising the senior management about the adoption and use of innovative medical technologies in APHP’s hospitals.
We have had to assess, in a limited time frame, the possible impact of the introduction of a device^{3} monitoring the hemoglobin concentration of patients undergoing surgical intervention. This device is used to produce a measurement (SpHb) of the current hemoglobin concentration by means of a sensor which is a variation of the pulse oxymetry sensors; this measure is supposed to replace the measurement (tHb) produced by a laboratory analyzer, thus avoiding the wait for the laboratory results (an element that could be important in a surgical context) and the disruption in the laboratory work flow caused by unplanned requests.
Previous studies of this device in various clinical settings showed that its measurement errors were large but almost symmetric around 0. A recent metaanalysis [1] aggregated the results reported in 32 papers, 13 of which reported results of operating room use; the average mean error (bias) in this surgical subgroup was 0.4 g/dL, but the measurement error standard deviation was larger than 1 g/dL in 15 of the 16 measurement series reported by these 13 papers.
The authors report a bias whose confidence interval includes 0, but they state “We have not found any publications that provide statistical methods to quantify the uncertainty of SD in metaanalysis”. Therefore, its clinical conclusions are based on hypotheses on the possible standard deviation of the measurement errors, without estimating it. The authors complete their conclusion on the bias by warning that “the wide LOA [limits of agreement] mean clinicians should be cautious when making clinical decisions based on these devices”.
In order to assess the usability of this device, our HTA therefore required the assessment of decision error risks, hence the need to estimate not only the bias (which can be done by a variety of methods, see [2] for an example), but also the variability of the measurements used in this decision. In other words, the use of this device requires not only the assessment of a (possibly “significant”) bias (i.e. an average error whose confidence/credible interval does not contain 0), but also of its variability (e.g. by estimating its standard deviation). This allows us to estimate the probability of a potential clinical decision error.
However, as pointed out by [1], such methods for metaanalytic assessment of variability are almost nonexistent in the field (see Discussion), hence our proposal.
We also wanted to assess the impact of an initial calibration of the device (proposed by some authors in order to remove patientspecific systematic errors) which consists in the subtraction from a given measure SpHb of an initial error SpHb_{0}−tHb_{0} obtained from initial calibrating measurements of SpHb and tHb:
Therefore, we propose a Bayesian model allowing to pool the information given in various papers about the distribution of measurement errors, and to use this estimation to assess its impact in the modeling of the clinical decision error risks of these two modes of use of the device.
Methods
Literature review
We repeated the published search strategy of [1] on Pubmed and Embase databases, and augmented this search by manual search in the references marked as “Related to” by Pubmed; we then obtained full texts of a first selection of papers, whose “References” section was used to complete the search. Our selection was driven by the following criteria :

The device whose operating characteristics were reported in the paper had to use the same operating principle as our target device.

The paper had to report clinical use during a surgical intervention.

The paper had to report an estimation of both mean and standard deviation of the differences of paired reference (tHb) and devicederived (SpHb) measurements made at the same time, or at least to quote some indicator (such as Bland & Altman’s LOA [3]) enabling to reconstruct these measures.
The selected papers were analyzed to extract and/or reconstruct sample sizes, observed point estimates of mean and standard deviation of each study population.
Modeling
For the intended use case (monitoring of hemoglobin concentration in the operating room), the measurement given by reference methods is the only available reference, and the anesthesiologists’ methods are built against this measure. Therefore, we ignored its possible errors and choose to consider tHb, as our standard.
In the selected papers, the same patient may have coupled tHb/SpHb measurements at one or more occasions; we shall see (see Table 1) that in most papers, these different occasions are merged in the same series, without information about intra and interpatient variabilities: other papers reported separately measurements made at different occasions, but without information on the possible correlation of measurement errors on the same patient.
Therefore, when a paper reported more than one series of measurement errors (i.e. set of assessments of this error made in the same circumstances on independent patients), these series were kept separate, and analyzed as independent: these series were usually characterized by a factor (e.g. operating phase) strongly linked to hemoglobin concentration, overwhelming the (weak) patientrelated factors.
In other words, we ignored a possible “paper” level in our model.
Raw SpHb
We postulated that in each series i in the literature, the individual measurement errors e _{ i,j,k }=SpHb_{ i,j,k }−tHb_{ i,j,k } in patient j of the series i at occasion k are normally distributed (Eq. (1) below). We also postulated that the seriesspecific means μ _{ i } of measurement errors (i.e. the seriesspecific biases) are normally distributed in the (hypothetical) population of all possible repetitions of such studies, with a populationlevel mean μ _{ m } (overall bias) and a populationlevel standard deviation σ _{ m } (2); similarly, the seriesspecific standard deviations σ _{ i } are supposed to have a lognormal (μ _{ s },σ _{ s }) distribution in the population (3).
The postulate of normality of measurement errors (1) allows us to use two wellknown results of the sampling theory from normal distributions to derive the likelihoods of the usual m and s estimators of μ and σ from a sample of size n:
(4) and (5) allow us to compute the likelihoods of the published serieslevel estimators m _{ i } and s _{ i } instead of requiring patientlevel data e_{ i,j,k }.
Calibrated SpHb
The error for occasion k in patient j in series i, e _{ i,j,k }, is defined by e _{ i,j,k }=SpHb_{ i,j,k }−tHb_{ i,j,k }. The error of _{ c }SpHb (“calibrated error”) _{ c } e _{ i,j,k } will be:
Now, in each series i, we can decompose e _{ i,j,k } as the sum of a seriesspecific bias μ _{ i }, a patient specific random effect f _{ i,j } distributed with mean 0 and variance \({\tau _{i}^{2}}\), and an occasionspecific random residual g _{ i,j,k } distributed with mean 0 and variance \(\upsilon _{i,j}^{2}\).
Suppose further that these terms are independent and, for simplicity, homoscedastic in each series^{4} (i.e. for all patients j of the series i, \(\upsilon _{i,j}^{2}={\upsilon _{i}^{2}}\)). Then, \(\forall i, \text {Var}\left (e_{\text {\textit {i,j,k}}}\right)={\sigma _{i}^{2}}=\text {Var}\left (\mu _{i}+f_{i,j}+g_{\text {\textit {i,j,k}}}\right)={\tau _{i}^{2}}+{\upsilon _{i}^{2}}\). However,
Therefore, \(\text {Var}\left ({~}_{c}e_{\text {\textit {i,j,k}}}\right)=2{\upsilon _{i}^{2}}\). The ratio of corrected to raw measurement standard errors is:
Under our assumptions, this ratio can take values between 0 (all error is patientspecific, with no residue, υ=0) and \(\sqrt {2}\) (all error is random, with no patientspecific component, τ=0). Both cases make sense in the current context.
The definition of the calibrated error implies (6) that it is (positively) correlated to the raw error; therefore, their difference should be (negatively) correlated to the raw error, and so should be their means.
It is equivalent to estimate τ and υ or σ and θ. The latter allows, as we shall see, to model series with and without calibrated errors in the same way.
We model the impact of calibration as variations of the measurement error’s mean and standard deviation (modeled, as before, as being normally distributed):
We model the position parameters μ _{ i } and δ _{ i } of individual series as having a bivariate normal distribution; similarly, we model their (suitably transformed) spread parameters σ _{ i } and θ _{ i } as bivariate normally distributed:
and, as before, (7) allows us to use (4) and (5), mutatis mutandis, to compute the likelihoods from the published data.
From (10)–(11) and the properties of the multivariate normal distribution, it follows that the marginal distribution of μ _{ i } is given by (2) and that the marginal distribution of logσ _{ i } is given by (3); therefore, despite the appearances, (2)–(3) describe the same model as (10)–(11) when the calibrated data are unknown.
Model implementation and fitting
A Bayesian implementation of this model was fitted by MCMC methods, using the Stan [4] modeling language through the rstan [5] interface to R [6]. The model uses Eqs. (4) and (5) to compute the likelihood of the data and directly implements Eqs. (2) and (3) for series without calibrated SpHb and (8) to (11) for series with calibrated SpHb.
Using (1) and (7), we also sampled the relevant parameters of a new study and of a new observation within this study at each iteration of the MCMC sampling, thus obtaining a sample representative of the (predictive) distribution of measurement errors without being constrained by the particulars of any study. This simulation of the characteristics of the device in a new setting is the basis of our inferences on its performance.
Since our data (means and logstandard deviations of errors in the published series) were already more or less centered around 0 and scaled about 1, we followed [7, 8] and choose a Cauchy(0,3) density as a weakly informative prior distribution for the location parameters μ _{ m },μ _{ δ } and the transformed spread parameters μ _{ ls } and μ _{ lt }, a half Cauchy(0,3) T[0,] for the standard deviations σ _{ m },σ _{ δ },σ _{ ls } and σ _{ lt }, and a Uniform(1,1) distribution for the correlation coefficients ρ _{ p } and ρ _{ s }. This choice allows for a weakly informative prior distribution robust with respect to a few outlier values without expressing unreasonable a prori beliefs in very large values of the parameters they model.
The resulting program is available as the Additional file 1; it is also part of the the noweb source of the present paper (see the Additional file 2 for instructions).
The convergence of the MCMC chains was checked by visual assessment of the MCMC traces (see Additional file 3), the ratios of MCMC standard deviation to standard deviation for each parameter of the model (see Additional file 4) and the chain convergence indicator \(\widehat {\mathrm {R}}\) (see [9]). The quality of the model was assessed by placing each observed quantity in the a posteriori distribution of the parameter it estimates (see Additional file 5).
Diagnostic impact assessment
We used the bias and standard deviation values created during model parameter estimation to assess the impact of measurement errors in terms of decision errors. We postulated that the true values tHb of hemoglobin concentration were uniformly distributed on the [4 12] g/dL range.
Let f the density of the measurement error E (whose realizations are the e _{ i,j,k } observations whose mean and standard deviation estimates are reported), and g the density of tHb (F and G being their respective distributions). The probability of observing a measurement SpHb lower than some threshold t (a “positive” reading in our case) is:
Similarly, the probability of a “true positive” is:
Since we modeled errors independent of “true” values tHb, these expressions simplify in:
The probability of a “positive” case being G(t) by definition, (14) and (15) are sufficient to compute the sensitivity, specificity and positive and negative predictive values.
The diagnostic impact of measurement errors depends on the distribution of the true values tHb. For reasons discussed below, we choose to assess this impact by postulating a uniform distribution of tHb on a range spanning the clinically useful range of threshold values. According to the literature, this range is about 6 to 10 g/dL [10–12]. Therefore, our impact assessment used an uniform distribution over the range from 4 to 12 g/dL.
Results
The literature review led us to select 21 papers [13–33] reporting 34 distinct estimations of the mean and standard deviation of measurement error; among these, four papers [24, 27, 28, 32] report the characteristics of measurement error after initial calibration in five series. The data extracted from the literature are listed in Table 1.
Model fit
In the text, posterior distributions are summarized as mean ± sd (credible interval) unit; the bounds of the credible intervals are the.025 and.975 quantiles. The full set of summary statistics for the MCMC sample can be found in the Additional file 4.
Analysis of raw SpHb measurement errors
The populationlevel results of the model fitting for raw SpHb measurement errors are depicted in Fig. 1 and summarized in Table 2; Table 3 summarizes predictive error results, i.e. bias and standard deviation in a new study (new setting), and mean error, squared error and absolute error for an new observation.
The overall mean error (bias) of raw SpHb has mean 0.23 ± 0.12 (−0.02 0.46) g/dL; the measurement error of raw SpHb is distributed around this mean with logstandard deviation 0.23 ± 0.04 (0.15 0.30) g/dL.
The mean expected bias (systematic error expected in a new study) is 0.24 ± 0.73 (−1.23 1.59) g/dL. The mean expected error (new measurement error in a new study) is 0.27 ± 1.47 (−2.56 3.26) g/dL, whereas the mean expected absolute error is 1.18 ± 0.92 (0.05 3.36) g/dL, and the root of the mean quadratic expected error is 1.50 g/dL.
Impact of calibration
The populationlevel estimates of the impact of calibration are presented in Table 4 and Fig. 2 and the simulationbased estimates of the resulting measurement errors are presented in Table 5, which also reports the expected bias correction and expected ratio of raw and calibrated standard deviations (inflation/deflation factor).
One notes that, whereas the bias correction is almost systematically negative (−0.42 ± 0.20 (−0.83 0.02) g/dL), the impact of calibration on standard error and expected errors is modest (the mean expected absolute error is 1.05 ± 0.87 (0.04 3.21) g/dL, which is not much less than in the noncalibrated case), and has a nonnegligible probability of enlarging the standard error (actually, for a new study, Pr(θ>1)≈ 0.102).
Estimation of clinical impact
The decisional impact of measurement errors of raw SpHb is summarized in Table 6 in terms of sensitivity, specificity, positive and negative predictive values (conditional probabilities) as well as accuracy and probability of a decision error (absolute probabilities); these results are illustrated in Fig. 3. Similarly, the Table 7 and the Fig. 4 summarize the diagnostic impact of measurement errors of calibrated SpHb. The resultant risks of decision errors and their credible regions are graphically compared in Fig. 5.
Discussion
Methods
Modeling
Studylevel modeling
A mixedmodel metaanalysis requires the estimation of one studylevel parameter per data point, plus any populationlevel parameters necessary to the model (in our case, population level mean and standard deviation or, in the case of _{ c }SpHb, differences with SpHb means and ratios to the SpHb standard deviation). This is true both for frequentist, MLbased, estimation and for Bayesian model fitting. Therefore, published metaanalyses usually do not allow for checking of their assumptions on which estimations and inferences are based on the sole basis of published data.
In our case, Eqs. (4) and (5) are crucial. The former is uncontroversial: this result is known to be asymptotically true for any sample of independent and identically distributed observation sampled from a distribution for which the central limit theorem holds; its rate of convergence is known to be good enough for almost any “large” sample (one finds often N>30 in common practice in applied statistics), and often considered sufficient for “reasonably” distributed small samples.
The latter is valid only for i.i.d. samples of normally distributed variables. We are not aware of any general asymptotic results concerning the estimation of variability parameters. This scarceness of general results, already noted by [1] has also motivated a recent paper by Nakagawa et al. [34], where the authors build tools for metaanalytic estimation of variability; the relevant tool for their question being the Coefficient of Variation, and ratios thereof, they derive the relevant estimators and their properties.
Their work is based on an unbiased estimator of the log of the standard deviation σ:
This equality can be derived from the lefthand side of (5). The authors add: It is assumed that with a large sample size and sufficiently large value of σ, logσ is normally distributed with variance \(s^{2}_{\log \sigma }\). They support this assumption by referencing a 1987 paper by Raudenbush & Bryk, which indeed derive a largesample theory for this case ([35], pp 250–1). Unfortunately, this paper also states that “First, the underlying data must be assumed normally distributed, an assumption which can be checked by conventional methods”. (ibid., p. 252).
In other words, the validity of (5) depends on the accuracy and rate of convergence of logs− logσ to a normal. We are not aware of any analytic or simulation convergence results for this quantity, but the convergence rate of a \({\chi ^{2}_{n}}\) distribution to a normal is known to be slow.
Since the individual data are unavailable by hypothesis in a metaanalytic context, the normality of the distribution of these data cannot be “checked”. The rate of asymptotic convergence to normality being unknown, the assumption of normality of individual data is a strong necessity of validity of our modeling.
We are not aware of any other literature pertinent to the metaanalytic estimtion of variability.
Populationlevel modeling
The modeling of means (Eq. 2) is the de facto standard in metaanalysis. The modeling of standard deviations (Eq. 3) is less so:
By analogy with the sampling distribution of the variance of identically sized normal samples, a gamma distribution was a “natural” candidate for this modeling. However, the interpretation of its parameters was delicate, and the elicitation of priors to these parameters even more so. Therefore, we choose to use a lognormal model of the standard deviations population. The point of this choice was to get a parametrization allowing easy interpretation and easy prior elicitation.
We also modeled μ _{ i } independent of σ _{ i }; this assumption simplifies programming, and appears reasonable: in the original data, the correlation of biases and standard deviations is 0.013 (similarly, the correlation of mean _{ c }SpHb and their standard deviations is −0.53, with only 3 d.f.).
Modeling of calibrated SpHb
The rationale for modeling _{ c }SpHb as we did has been exposed supra. We could have also used a single model, using only (10) and (11) and treating the (hypothetical) values of _{ c }SpHb in studies not reporting it as missing data (supplementary parameters to the model). The results should be equivalent, but the programming would have been more awkward.
Prior distributions
We needed to give our hyperparameters a proper prior distribution, in order to get proper posterior distributions and to be able to use the logposterior samples to estimate a Bayes factor. However, we had very little information on the distribution of our subject of interest before reading the relevant papers; therefore, we choose to use weakly informative priors. Centering them on 0 was uncontroversial. The difficulty was in the choice of shape and scale.
It has been noted that the common \({\mathcal {N}}(0,V^{2})\) for some very large standard deviation V, often used as a “weakly informative” prior distribution, expresses a prior belief of absolute values larger than V of about 0.32. Choosing an unreasonably high value of V is hardly defensible in face of the subject matter.
Our choice of priors was remotely inspired by the work of Gelman et al. [7, 8] and we borrowed their proposed functional form, except for correlation coefficients where a Uniform(−1,1) was a natural choice.
Clinical impact assessment
We choose to report the clinical performance of the device by the (absolute) probability of a (potential) decision error; this index seemed more clinically intepretable and useable than specificities and sensitivities (which can be traded off against one another by the choice of threshold), which are conditional probabilities.
Postulating the independence of values of measured variable and measurement error allowed us to use (14) and (15), which can be simply computed, at least with our choice of population distribution of the variable, with a standard numerical integration routine.
They can even be explicitly solved in some cases: for example, a normal density of values and a normal density of errors convolve to a normal density, which can be trivially used to compute the probability of errors. However, this model would not have reasonable medical support in our case.
Without this postulate, the multiple integrals (12) and (13) are much more problematic to compute numerically, and a better solution would probably be to estimate them by one form or another of Monte Carlo integration.
We had also to choose a range of “useful” values to assess the potential clinical impact of the measurement errors.

It is obvious that real values quite far from the decision threshold do not contribute to false positive/negative (the probability of a large error is small).

On the other hand, a small region quite close to the decision threshold overstates the risk of false results. For example, with a symmetric error density, given a threshold T in a study i with average bias μ _{ i }, the rate of false negative a test region (T+μ _{ i }+ε T+μ _{ i }+2ε) would have a limit of \(\frac {1}{2}\) for ε→0^{+}.

Similarly, a strong mode would overstate the importance the weight of the region surrounding that mode.
The anesthesiological literature shows that a “reasonable” region for transfusion decision threshold is [6 10] g/dL; the choice of a threshold in this range for a given patient depends on various domain and patientspecific factors.
It was therefore necessary to cover this range (with extension to “likely” regions), without justification to choose a mode. This led us to choose the [4 12] g/dL.
A better choice would have been to model the distribution of tHb of measurements done for clinical reasons (i.e. excluding the systematic or calibration measurements). The source papers, however, did not document this information in any usable form.
Results
Raw SpHb
The posterior distribution of measurement errors in a new study is slightly asymmetric around 0 and gives a nonnegligible probability to large measurement errors; one also notes that the mean absolute expected error is a large fraction of the range of clinically useful range of measurement.
Impact of calibration
One notes that the convergence of the estimation of _{ c }SpHbrelated parameters by MCMC is more difficult than for raw SpHbrelated ones (smaller n_eff): this might be accounted for by the very low number of available data; only five series after calibration have been published, which might be the absolute minimal sample size for estimating variability.
One also notes that the mean calibrated measurement error is negative. This might have a natural explanation: the calibration is made at the beginning of an intervention, when SpHb is, in general, normal for the patient, whereas the clinically useful measurements are done during intervention, when SpHb might have been lowered by surgical hemorrhage and subsequent perfusion. Several authors [17, 20, 21, 23, 26] have reported, with various levels of precision, a relation between (true) hemoglobin concentration and measurement error; this might explain why the correction, computed on a highhemoglobin concentration basis, is insufficient to cancel the actual bias observed in lowhemoglobin concentration conditions.
One notes, however, that none of these papers reporting this valueerror relationship gave enough detail to allow modeling; our analysis, which is therefore compelled to ignore it, is therefore a simplification of the real situation (but probably not an oversimplification).
Use of trends
Several authors have reported to have used SpHb (or SpHb _{c}) in terms of trends in time, allowing them to assess the need for a reference measurement tHb rather than the need for transfusion; other authors suggested using these trends, but without reporting actual use. However, these reports were not precise and consistent enough to allow a modeling of this use without resorting to correspondence with the original authors for crucial details. Our time limits precluded such a research.
Clinical impact
The probability of a “false report” (false positive or false negative) varies slightly over the range of clinically useful thresholds; however, this probability (13–14 % for raw SpHb, around 13 % for calibrated SpHb) remains clinically problematic: it would affect about one patient out of seven. However, the risk of “effective clinical decision error” is probably lower: the hemoglobin concentration is but one input in a complex decision process, whose analysis on the basis of published information is impossible.
The asymmetry of the diagnostic value curves around the midpoint of the clinically useful range (Figs. 3 and 4) is a consequence of the slight biases of raw SpHb and calibrated SpHb.
One should note (see Fig. 5) that the risks of “false reports” are much more uncertain for calibrated SpHb than for raw SpHb, a consequence (again) of the very low number of published studies on calibrated SpHb.
Study limitations
The present study has a number of restrictions that limit its significance: Literature review The allocated time for our review precluded an extensive search for gray literature; it also precluded querying the original authors for precisions about their results. Limiting a metaanalysis to formally published data is known to reinforce imprecision.
Similarly, we did not conduct a formal bias risk assessment; this, however, was of lesser consequence for our goals. Study design We did not compare the proposed monitor and the reference method (laboratory measurement) to a common (hypothetical) “gold standard”; instead, we assessed the impact of substituting the monitoring to the reference method in terms of clinical decision differences. Since the reference method is the current “clinical gold standard”, it is supposed perfect for clinical purposes, and its possible false positives and false negatives are ignored.
Such a comparison, which might have been be worthy if the proposed monitoring had a variability close to the variability of the reference method (various sources quote a mean absolute expected error in the 0.1–0.3 g/dL range), would need an assessment of the reference method, unavailable from published data. Modeling We didn’t even consider fitting a socalled “fixed effects” model, considering that heterogeneity of published data was selfevident.
Lack of time precluded a sensitivity study of the impact of the shape of the studylevel parameters distribution and most notably of the shape of the populationlevel parameters distribution. Our choices seem “reasonable”, but their impact has not been assessed. Further work should assess these impacts.
Similarly, the impact of a departure from the assumption of normality of measurement errors should be systematically assessed, both analytically and with simulation approaches.
Our goals in modeling were limited to the assessment of measurement error and its consequences in terms of decisional errors. In contrast, the authors of [1] created a multiple regression model allowing them to assess the impact of various covariates. This modeling, probably very interesting to anesthesiologists and physiologists, was out of our scope of assessing the practical usability of the device under examination.
Finally, we did not try to assess the reality of the impact of calibration in terms of “hypothesis testing” or “model comparison”: this question was not in our scope of interest. Clinical impact estimation Modeling the clinical consequences of “false reports” is a much more intricate problem, requiring the modelization of a large body of medical knowledge. This question was widely out of the limited scope of the present study.
It should be noted that the main result in terms of clinical impact is an absolute probability of error rather than a conditional probability (such as sensitivity, specificity or predictive values).
Conclusions
This study has shown that:

Under the assumption of normality, a hierarchical model of variability can be built and used to estimate the variability of a phenomenon from published aggregate data, without recourse to individual data.

This estimation can be used to assess the decisional (binary) consequences of the variability of the phenomenon of interest.

The device of interest has been shown to have a mean absolute expected error of 1.18 ± 0.92 (0.05 3.36) g/dL, which is large when compared to the clinically useful range of measurements.

The mean measurement error (bias) is 0.24 ± 0.73 (−1.23 1.59) g/dL, whose 95 % credible interval contains 0, and which is negligible compared to the mean absolute expected error.

This measurement error entails a risk of decision error potentially impacting one patient out of seven, which is clinically problematic.

This risk of “false report” is therefore much less a consequence of the mean expected error (bias) than a consequence of the mean absolute error (variability of the measurement).

A calibration of the device using an initial reference measurement does not change this situation to any clinically relevant extent.
The proposed model, whose range of validity remains to be assessed, allows estimation of the variabilitybound decision errors risk of a measurement from published aggregates; in the motivating example of hemoglobin concentration monitoring, this estimation shows that its clinical use is problematic.
Ethical approval and consent
The present paper illustrates the proposed model with an example using alreadypublished data. The authors did not check the conformance of the original papers to the Declaration of Helsinki and relied on the original papers publishers’ checks.
Standards of reporting
The present paper illustrates the proposed model with an example using alreadypublished data; however, it does not aim to be a fullfledged systematic review of the motivating case. In consequence, the authors did not use the PRISMA checklist; this is discussed as one of the study limitations.
Endnotes
^{1} Comité d’Évaluation et de Diffusion des Innovations Technologiques de l’APHP.
^{2} Assistance Publique — Hôpitaux de Paris.
^{3} Masimo Radical7, Masimo Corp., USA. This device uses an extension of plethysmography by evaluating skin reflectance at 12 different wavelengths.
^{4} The assumption of homoscedasticity of the residual errors allows for a simpler expression of the decomposition of the total error, but is not a necessary condition of validity; the assumption of independence is more crucial.
Abbreviations
 HTA:

Health Technology Assessment: “the systematic evaluation of the properties and effects of a health technology, addressing the direct and intended effects of this technology, as well as its indirect and unintended consequences, and aimed mainly at informing decision making regarding health technologies” [36]
 tHb:

Hemoglobin concentration as measured by the reference method
 SpHb:

Hemoglobin concentration as measured by the device of interest
 _{c}SpHb:

Hemoglobin concentration as measured by the device of interest and corrected of initial bias
 LOA:

Limits of agreement, a term widely used in methods comparison studies. See [3]
 \({\mathcal {N}}(\mu, \sigma ^{2})\) :

Denotes a normal density of mean μ and variance σ ^{2}
 MCMC:

MonteCarlo Markov Chains
 ML:

Maximum Likelihood
 \({\mathcal {L}N}(\mu, \sigma ^{2})\) :

Denotes a lognormal density of location parameter μ and spread parameter σ ^{2}
 \({\mathcal {M}VN}(\boldsymbol {\mu },\boldsymbol {\Sigma })\) :

Denotes a multivariate normal density of mean vector μ and variancecovariance matrix Σ
 t _{ n } :

Denotes a standard Student’s t density with n degrees of freedom
 \({\chi ^{2}_{n}}\) :

Denotes a chisquared density with n degrees of freedom
References
 1
Kim SH, Lilot M, Murphy LSL, Sidhu KS, Yu Z, Rinehart J, et al. Accuracy of continuous noninvasive hemoglobin monitoring: a systematic review and metaanalysis. Anesth Analg. 2014; 119(2):332–46. doi:ANE.0000000000000272.
 2
Williamson PR, Lancaster GA, Craig JV, Smyth RL. Metaanalysis of method comparison studies. Stat Med. 2002; 21(14):2013–025. doi:10.1002/sim.1158 Accessed 20150310.
 3
Bland MJ, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement.The Lancet. 1986; 327(8476):307–10. doi:10.1016/S01406736(86)908378.
 4
Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.6; 2015. http://mcstan.org/. Accessed on January 8, 2016.
 5
Stan Development Team. RStan: the R Interface to Stan, Version 2.5.0; 2014. http://mcstan.org/interfaces/rstan.html. Accessed on January 8, 2016.
 6
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2015. http://www.rproject.org/. Accessed on January 8, 2016.
 7
Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006; 1(3):515–33.
 8
Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Ann Appl Stat. 2008; 2(4):1360–83. doi:10.1214/08AOAS191 Accessed 20150413.
 9
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis, Third Edition, 3rd. Boca Raton: Chapman and Hall/CRC; 2013.
 10
Carson JL, Carless PA, Hebert PC. Transfusion thresholds and other strategies for guiding allogeneic red blood cell transfusion. Cochrane Database Syst Rev. 2012; 4:002042. doi:10.1002/14651858.CD002042.pub3.
 11
Carson JL, Carless PA, Hébert PC. Outcomes using lower vs higher hemoglobin thresholds for red blood cell transfusion. JAMA. 2013; 309(1):83–4. doi:10.1001/jama.2012.50429.
 12
Transfusion de globules rouges homologues  Anesthésie réanimation chirurgie. Haute Autorité de Santé: Argumentaire scientifique. 2014. http://www.hassante.fr/portail/upload/docs/application/pdf/201502/transfusion_de_globules_rouges_homologues_. anesthesie_reanimation_chirurgie_urgence__argumentaire.pdf, Accessed 20150406.
 13
Berkow L, Rotolo S, Mirski E. Continuous noninvasive hemoglobin monitoring during complex spine surgery. Anesth Analg. 2011; 113(6):1396–402. doi:10.1213/ANE.0b013e318230b425.
 14
Causey MW, Miller S, Foster A, Beekley A, Zenger D, Martin M. Validation of noninvasive hemoglobin measurements using the Masimo Radical7 SpHb Station. Am J Surg. 2011; 201(5):592–8. doi:10.1016/j.amjsurg.2011.01.020.
 15
Lamhaut L, Apriotesei R, Combes X, Lejay M, Carli P, Vivien B. Comparison of the accuracy of noninvasive hemoglobin monitoring by spectrophotometry (SpHb) and HemoCue Ⓡ with automated laboratory hemoglobin measurement. Anesthesiology. 2011; 115(3):548–54. doi:10.1097/ALN.0b013e3182270c22.
 16
Miller RD, Ward TA, Shiboski SC, Cohen NH. A comparison of three methods of hemoglobin monitoring in patients undergoing spine surgery. Anesth Analg. 2011; 112(4):858–63. doi:10.1213/ANE.0b013e31820eecd1.
 17
Applegate RL, Barr SJ, Collier CE, Rook JL, Mangus DB, Allard MW. Evaluation of pulse cooximetry in patients undergoing abdominal or pelvic surgery. Anesthesiology. 2012; 116(1):65–72. doi:10.1097/ALN.0b013e31823d774f.
 18
Butwick A, Hilton G, Carvalho B. Noninvasive haemoglobin measurement in patients undergoing elective Caesarean section. Br J Anaesth. 2012; 108(2):271–7. doi:10.1093/bja/aer373.
 19
Colquhoun DA, Forkin KT, Durieux ME, Thiele RH. Ability of the Masimo pulse COOximeter to detect changes in hemoglobin. J Clin Monit Comput. 2012; 26(2):69–73. doi:10.1007/s1087701293353.
 20
Park YH, Lee JH, Song HG, Byon HJ, Kim HS, Kim JT. The accuracy of noninvasive hemoglobin monitoring using the radical7 pulse COOximeter in children undergoing neurosurgery. Anesth Analg. 2012; 115(6):1302–7. doi:10.1213/ANE.0b013e31826b7e38.
 21
Vos JJ, Kalmar AF, Struys MMRF, Porte RJ, Wietasch JKG, Scheeren TWL, et al. Accuracy of noninvasive measurement of haemoglobin concentration by pulse cooximetry during steadystate and dynamic conditions in liver surgery. Br J Anaesth. 2012; 109(4):522–8. doi:10.1093/bja/aes234.
 22
Dewhirst E, Naguib A, Winch P, Rice J, Galantowicz M, McConnell P, et al. Accuracy of noninvasive and continuous hemoglobin measurement by pulse cooximetry during preoperative phlebotomy. J Intensive Care Med. 2013; 29(4):238–42. doi:10.1177/0885066613485355.
 23
Giraud B, Frasca D, Debaene B, Mimoz O. Comparison of haemoglobin measurement methods in the operating theatre. Br J Anaesth. 2013; 111(6):946–54. doi:10.1093/bja/aet252.
 24
Isosu T, Obara S, Hosono A, Ohashi S, Nakano Y, Imaizumi T, et al. Validation of continuous and noninvasive hemoglobin monitoring by pulse COoximetry in Japanese surgical patients. J Clin Monit Comput. 2013; 27(1):55–60. doi:10.1007/s1087701293972.
 25
Sjöstrand F, Rodhe P, Berglund E, Lundström N, Svensen C. The use of a noninvasive hemoglobin monitor for volume kinetic analysis in an emergency room setting. Anesth Analg. 2013; 116(2):337–42. doi:10.1213/ANE.0b013e318277dee3.
 26
Kim SH, Choi JM, Kim HJ, Choi SS, Choi IC. Continuous noninvasive hemoglobin measurement is useful in patients undergoing doublejaw surgery. J Oral Maxillofac Surg: Official J Am Assoc Oral Maxillofacial Surgeons. 2014; 72(9):1813–9. doi:10.1016/j.joms.2014.03.011.
 27
Miyashita R, Hirata N, Sugino S, Mimura M, Yamakage M. Improved noninvasive total haemoglobin measurements after invivo adjustment. Anaesthesia. 2014; 69(7):752–6. doi:s10.1111/anae.12681.
 28
Patino M, Schultz L, Hossain M, Moeller J, Mahmoud M, Gunter J, et al. Trending and accuracy of noninvasive hemoglobin monitoring in pediatric perioperative patients. Anesth Analg. 2014; 119(4):920–5. doi:ANE.0000000000000369.
 29
Toyoda D, Yasumura R, Fukuda M, Ochiai R, Kotake Y. Evaluation of multiwave pulse totalhemoglobinometer during general anesthesia. J Anesthesia. 2014; 28(3):463–6. doi:10.1007/s0054001317305.
 30
Yamaura K, Nanishi N, Higashi M, Hoka S. Effects of thermoregulatory vasoconstriction on pulse hemoglobin measurements using a cooximeter in patients undergoing surgery. J Clinical Anesthesia. 2014; 26(8):643–7. doi:10.1016/j.jclinane.2014.04.012.
 31
Awada WN, Mohmoued MF, Radwan TM, Hussien GZ, Elkady HW. Continuous and noninvasive hemoglobin monitoring reduces red blood cell transfusion during neurosurgery: a prospective cohort study. J Clin Monit Comput. 2015. doi:10.1007/s1087701596604.
 32
Frasca D, Mounios H, Giraud B, Boisson M, Debaene B, Mimoz O. Continuous monitoring of haemoglobin concentration after invivo adjustment in patients undergoing surgery with blood loss. Anaesthesia. 2015. doi:10.1111/anae.13028.
 33
Saito J, Kitayama M, Oishi M, Kudo T, Sawada M, Hashimoto H, et al. The accuracy of noninvasively continuous total hemoglobin measurement by pulse COOximetry undergoing acute normovolemic hemodilution and reinfusion of autologous blood. J Anesthesia. 2015; 29(1):29–34. doi:10.1007/s0054001418631.
 34
Nakagawa S, Poulin R, Mengersen K, Reinhold K, Engqvist L, Lagisz M, Senior AM. Metaanalysis of variation: ecological and evolutionary applications and beyond. Methods Ecol Evol. 2015; 6(2):143–52. doi:10.1111/2041210X.12309.
 35
Raudenbush SW, Bryk AS. Examining Correlates of Diversity. J Educ Stat. 1987; 12(3):241–69. doi:10.2307/1164686 Accessed 20150413.
 36
HtaGlossary.net  health technology assessment (HTA). http://htaglossary.net/health+technology+assessment+%28HTA%29.
Acknowledgements
The authors warmly thank Pr Bruno Falissard (Hôpital PaulBrousse (APHP) and Unité INSERM U669), who declined authorship of this paper notwithstanding the importance of his contributions, criticisms and suggestions.
The authors also wish to thank Dr Shinichi Nakagawa, Associate Professor of Behavioural Ecology, University of Otago, who was kind enough to send us a copy of his paper unobtainable from french public universities library system.
The authors wish to thank Dr Ian R White, MRC Biostatistic Unit, and Pr Sylwia Bujkiewicz, Lecturer in Biostatistics, University of Leicester, reviewers of earlier versions of the present paper, for their insightful and helpful comments and suggestions which, among other enhancements, corrected a serious deficiency of a previous version of the presented model.
In memoriam Jean Charpentier, Aug 9, 1930 — Sep 7, 2015.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Starting from a larger assessment study led by EC and AB, EC designed the study, extracted the analyzed data, conceived, refined and ran the probabilistic model, proposed its medical implications and drafted successive versions of the paper; EC, VL and BF searched and collected the primary data, validated data extraction, criticized the model, the results and their interpretation; AB and LG contributed to reframing the methods and results in the larger HTA context, and provided much needed medical criticism and, therefore, a large part of the discussion. However, the authors wish to underscore that this paper, as any scientific paper, results from a team work, where questions and criticisms of some authors were as important in the development of the paper as answers from other authors; therefore, the authors do not wish to separate their respective contributions. All authors read and approved the final manuscript.
Additional files
Additional file 1
Stan implementation of the metaanalytic model. (STAN 7.66 kb)
Additional file 2
(Unix) text file: how to reproduce this paper (incl.software requirements). (TXT 1.70 kb)
Additional file 3
MCMC traces (postwarmup). (PDF 2170.88 kb)
Additional file 4
Summary of the posterior distribution of all parameters in the model. (PDF 41.3 kb)
Additional file 5
Boxplots of studylevel parameters distributions against the observed data they model. (PDF 27.2 kb)
Additional file 6
noweb (knitr) source of this paper. Includes data, R and Stan code, bibliographic database. (RNW 182 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Charpentier, E., Looten, V., Fahlgren, B. et al. Metaanalytic estimation of measurement variability and assessment of its impact on decisionmaking: the case of perioperative haemoglobin concentration monitoring. BMC Med Res Methodol 16, 7 (2016) doi:10.1186/s1287401601075
Received
Accepted
Published
DOI
Keywords
 Methods
 metaanalysis as topic
 Observer variation
 Reproducibility of results
 Predictive value of tests
 Metaanalysis
 Monitoring
 intraoperative
 Monitoring
 physiologic/methods
 Biological markers/blood
 Hemoglobinometry
 Oximetry
 AMS Subject Classification
 Primary 62F15
 secondary 92B15