This paper is a timely reminder that there are a greater range of issues than just statistical issues that need to be addressed when analysing administrative or registry data collections. Often the main error or bias is measurement error. Data analysts are often ill-equipped to deal with these issues because statistical training has not kept pace with the increasing availability of large administrative data bases collected by health departments and other organisations. These data collections offer an often limitless number of potential analyses, but it can be unclear what statistical methods can be appropriately applied. Often the counts are so large in such collections that practical variation is more important than random variation.

I disagree with one aspect of the paper, however. The example presented on page 8 of the final draft states that because there was complete enumeration, there is no random process. This neglects the extremely useful statistical methods developed around the Poisson distribution. The Poisson distributions can generally be applied to counts of outcomes that occur over time, such as counts of disease occurring in a population. It allows for the random variation in the number of occurrences of an event that occur from one time interval to the next. The incidence of the disease presented in the example could be described as an estimate of the mean of a Poisson distribution.

Using the formulae on page 154 of Armitage P, Berry G, and Matthews JNS, Statistical Methods in Medical Research, Fourth Edition, an approximate 95% confidence interval for the count of 8,650 is 8,467 to 8,833. This represents the random variation in the count that might be observed when other one year intervals are studied in that population.

Competing interests

None

Re: Don't forget the Poisson distribution

Carl Phillips, University of Texas

20 August 2003

I agree with most of David Muscatello's comments about our paper, and appreciate his commentary and interest in the analysis. I would like to offer a clarification on the point about the Poisson distribution.

I agree that when a complete enumeration of one year's data (or any other unbiased subset of all possible data) is used to estimate an average rate, then -- if the stochastic process governing the distribution of rates for subpopulations is known -- a confidence interval for the average rate can be calculated. For example, as suggested, if we believed that disease rates followed a Poisson distribution across subpopulations, then we could do the suggested calculation. The most common application of this, of course, can be found in what we normally do when we believe we have a random subset of all possible data. Indeed (in case the pun in the previous sentence was too subtle to make the point), for large enough numbers, the Poisson is approximated by the binomial distribution, which in turn is approximated by the normal, and so we are simply recreating the usual calculation for a (sufficiently large) random sample from a (sufficiently larger) population.

However, in the example given in the paper, it should be noted that our hypothetical researchers are not making a claim about the average rate, just the rate for that year. Thus no random process exists; the probability distribution of all possible incidence rates for the one-year period has collapsed into a particular value, and all that is left in the example is measurement error. This may seem like a mere semantic point, but it can also be seen as an example of how the quality of an answer is highly dependent on what question is being asked: If we are trying to extrapolate from one year's data to an average over time, this stochastic error is part of the answer; if we are trying to enumerate one year's data, it is not.

It should also be noted that, though we do not specify what disease we are talking about in that example, many disease incidence rates do not follow a Poisson distribution (and thus also cannot be approximated by a normal distribution). Diseases that are contagious, whose incidence is dominated by outbreaks (e.g., some foodborne diseases if person-time is sufficiently small), or are substantially affected by common exposures (e.g., weather) may have distributions that are much flatter than Poisson.

Finally, to reemphasize one of the main points from the paper (and from C.V. Phillips, "Quantifying and Reporting Uncertainty from Systematic Errors," Epidemiology 14(4):459-466, 2003), even if stochastic error can be quantified, it often dramatically understates total error. If we did believe that a Poisson process described the variation in incidence rates across years and were interested in estimating the average rate rather than just that year's rate, the calculated stochastic error would be in the order 1 part in 50. This compares to the estimated measurement error of about 1 part in 10 (on each side of what would be the point estimate). Perhaps the stochastic error is large enough to not qualify as "inconsequential," but it is still far from the most important source of error.

Thanks again to the commentator and other readers for interest in the article.

## Don't forget the Poisson distribution

David Muscatello, NSW Department of Health

16 June 2003

This paper is a timely reminder that there are a greater range of issues than just statistical issues that need to be addressed when analysing administrative or registry data collections. Often the main error or bias is measurement error. Data analysts are often ill-equipped to deal with these issues because statistical training has not kept pace with the increasing availability of large administrative data bases collected by health departments and other organisations. These data collections offer an often limitless number of potential analyses, but it can be unclear what statistical methods can be appropriately applied. Often the counts are so large in such collections that practical variation is more important than random variation.

I disagree with one aspect of the paper, however. The example presented on page 8 of the final draft states that because there was complete enumeration, there is no random process. This neglects the extremely useful statistical methods developed around the Poisson distribution. The Poisson distributions can generally be applied to counts of outcomes that occur over time, such as counts of disease occurring in a population. It allows for the random variation in the number of occurrences of an event that occur from one time interval to the next. The incidence of the disease presented in the example could be described as an estimate of the mean of a Poisson distribution.

Using the formulae on page 154 of Armitage P, Berry G, and Matthews JNS, Statistical Methods in Medical Research, Fourth Edition, an approximate 95% confidence interval for the count of 8,650 is 8,467 to 8,833. This represents the random variation in the count that might be observed when other one year intervals are studied in that population.

## Competing interests

None

## Re: Don't forget the Poisson distribution

Carl Phillips, University of Texas

20 August 2003

I agree with most of David Muscatello's comments about our paper, and appreciate his commentary and interest in the analysis. I would like to offer a clarification on the point about the Poisson distribution.

I agree that when a complete enumeration of one year's data (or any other unbiased subset of all possible data) is used to estimate an average rate, then -- if the stochastic process governing the distribution of rates for subpopulations is known -- a confidence interval for the average rate can be calculated. For example, as suggested, if we believed that disease rates followed a Poisson distribution across subpopulations, then we could do the suggested calculation. The most common application of this, of course, can be found in what we normally do when we believe we have a random subset of all possible data. Indeed (in case the pun in the previous sentence was too subtle to make the point), for large enough numbers, the Poisson is approximated by the binomial distribution, which in turn is approximated by the normal, and so we are simply recreating the usual calculation for a (sufficiently large) random sample from a (sufficiently larger) population.

However, in the example given in the paper, it should be noted that our hypothetical researchers are not making a claim about the average rate, just the rate for that year. Thus no random process exists; the probability distribution of all possible incidence rates for the one-year period has collapsed into a particular value, and all that is left in the example is measurement error. This may seem like a mere semantic point, but it can also be seen as an example of how the quality of an answer is highly dependent on what question is being asked: If we are trying to extrapolate from one year's data to an average over time, this stochastic error is part of the answer; if we are trying to enumerate one year's data, it is not.

It should also be noted that, though we do not specify what disease we are talking about in that example, many disease incidence rates do not follow a Poisson distribution (and thus also cannot be approximated by a normal distribution). Diseases that are contagious, whose incidence is dominated by outbreaks (e.g., some foodborne diseases if person-time is sufficiently small), or are substantially affected by common exposures (e.g., weather) may have distributions that are much flatter than Poisson.

Finally, to reemphasize one of the main points from the paper (and from C.V. Phillips, "Quantifying and Reporting Uncertainty from Systematic Errors,"

Epidemiology14(4):459-466, 2003), even if stochastic error can be quantified, it often dramatically understates total error. If we did believe that a Poisson process described the variation in incidence rates across years and were interested in estimating the average rate rather than just that year's rate, the calculated stochastic error would be in the order 1 part in 50. This compares to the estimated measurement error of about 1 part in 10 (on each side of what would be the point estimate). Perhaps the stochastic error is large enough to not qualify as "inconsequential," but it is still far from the most important source of error.Thanks again to the commentator and other readers for interest in the article.

## Competing interests

Listed in paper authorship.