# Measuring the prevalence of disease with imperfect diagnostic tests Photo credit to Gerd Altmann

In this article, we will explain how to measure the prevalence of a disease with an imperfect test (sensitivity and/or specificity less than 100%). We will assume that the reader understands the information in the previous articles on binary classification.

## Setting up the problem

When we measure a group of 100 people without a disease using a test with 95% specificity, we expect 95 true negatives and 5 false positives. In the same way a coin flipped 10 times is does not always give 5 heads and 5 tails, the situation above does not alway give 95 true negatives and 5 false positives. There is a probability distribution function describing the frequency of deviation from the expected value. To make this a little easier to interpret we are going to start by examining probability mass functions, probability functions for discrete events.

Consider the a weighted coin flip with probability $$p$$ of coming up heads. The probability of $$n$$ heads in $$x$$ flips is the product of the probability of those individual events, $$p^{n}\times (1-p)^{x-n}$$, times the number of combinations those events may occur in, $$\frac{x!}{n!(x-n)!}$$. The identical math applies to a medical diagnostic test applied to only positive patients. For a sensitivity $$S_n$$, we can compute the probability of obtaining $$n$$ number of positive results over $$x$$ positive patients.

$$P\left(n_\text{pos},x_\text{tests on pos people},S_n\right) = (S_n)^{n} (1-S_n)^{x-n} \frac{x!}{n!(x-n)!}$$

[Caption] Interactive plot of the probability mass function of weighted coin flips. Adjust the weight of the coin by dragging the red dot on the x-axis. Adjust the number of coin flips with the red dot on the right y-axis. Zoom in by highlighting regions of the graph, double click to restore. The width of this distribution is wider than most people intuit. Once the graph becomes a line (large sample sizes), the absolute value of the graph numbers will not be accurate if the graph is zoomed in so that only a portion of the bell curve is shown.

When we measure a population, we do not know who is positive and negative (that is why we have a test). All of the positive people form a distribution depending on the sensitivity and all of the negative people form a distribution depending on specificity. Notation is tricky here. $$P_\text{pos}$$ is the probability function of positive results for the patients who are really positive. $$P_\text{neg}$$ is the probability function of positive results for the patients who are really negative.

\begin{align} P_\text{pos}(n) &=P\left(n_\text{pos},x_\text{tests on pos people},S_n\right) = (S_n)^{n} (1-S_n)^{x-n} \frac{x!}{n!(x-n)!}\\ P_\text{neg}(n) &= P\left(n_\text{pos},x_\text{tests on neg people},(1-S_p)\right) = (1-S_p)^{n} (S_p)^{x-n} \frac{x!}{n!(x-n)!} \end{align}

To combine these distributions we have to take a sum and a product. The probability of any two independent events occuring is the product of their individual probabilities. The probability of $$n$$ positive tests is the sum of the probabilities of all situations which lead to $$n$$ positive test results.

$$P_\text{total}\left( n_\text{positives}\right) = \sum_{m = 0...n} P_\text{pos}\left( m \right) \times P_\text{neg}\left( n-m \right)$$ [Caption] Plot of the probability mass function for a group of 1000 patients with a 10% prevalence, $$S_n =$$ 90%, and $$S_p =$$ 95%. This probability function is a result of combining the probability functions from the negative and postive groups separately. Intuitively this makes some sense. We have 100 positive patients and a 90% sensitivity so we expect 90 positive results there. We have 900 negative patients and a 95% specificity so we expect 45 positive results from them. Together, we should expect about 135 positive results. However, we obtain exactly 135 less than 6% of the time because the common results are distributed between 120-150.

The graph above shows the probability mass function for the number of positive test results from a sample of 1000 patients and a 10% disease prevalence using a test with 90% and 95% sensitivity and specificity, respectively. It is important to notice if we run this experiment we will arrive at one measured prevalence (x out of 1000 patients will test positive). It is overwhelmingly likely that we will obtain 120-150 positive test results. However, unless we immediately retest the samples with a perfect test, we never find how many of the postive test results were true positive and how many were false.

This test (90%/95% sensitivity/specificity) cannot exactly measure the prevalence of this population but it can put some bounds on it. If the prevalence is 10%, we know there is a 98% chance our test returns 119-152 positive results. If the results are outside that range, we would conclude the prevalence is not 10%.

## Working with large samples

$$P_\text{pos} \left(n_\text{pos},x_\text{tests on pos people},S_n\right) = (S_n)^{n} (1-S_n)^{x-n} \frac{x!}{n!(x-n)!}$$

This equation produces computational problems for large $$n_\text{pos}$$ and $$x_\text{tests (on pos people)}$$. The first two factors are extremely small and the ratio of factorials is extremely large. To avoid this limitation of computer power, take the logarithm of both sides. Remember Stirling's approximation, $$\log(a!) \approx a\log(a)-a$$ for large $$a$$. The application of Stirling's approximation to combinatorics is a crucial part of statistical mechanics. As such, Stirling's approximation appears in the early chapters of every statistical mechanics textbook (or at least every book I am familiar with) [1,2,3].

\begin{align} P_\text{pos} \left(n_\text{pos},x_\text{tests},S_n\right) &= (S_n)^{n} (1-S_n)^{x-n} \frac{x!}{n!(x-n)!}\\ \log \bigg ( P_\text{pos} \left(n_\text{pos},x_\text{tests},S_n\right)\bigg) &= \log \left( (S_n)^{n} (1-S_n)^{x-n} \frac{x!}{n!(x-n)!}\right) \\ \log \bigg ( P_\text{pos} \left(n_\text{pos},x_\text{tests},S_n\right)\bigg) &= n\log(S_n) + (x-n)\log(1-S_n) +\log(x!) -\log(n!) -\log\big( (x-n)!\big) \\ \log \bigg ( P_\text{pos} \left(n_\text{pos},x_\text{tests},S_n\right)\bigg) &= n\log(S_n) + (x-n)\log(1-S_n) +x\log(x)-x -n\log(n)+n -(x-n)\log(x-n)+(x-n) \\ \log \bigg ( P_\text{pos} \left(n_\text{pos},x_\text{tests},S_n\right)\bigg) &= n\log(S_n) + (x-n)\log(1-S_n) +x\log(x) -n\log(n) -(x-n)\log(x-n) \\ P_\text{pos} \left(n_\text{pos},x_\text{tests},S_n\right) &= e^{n\log(S_n) + (x-n)\log(1-S_n) +x\log(x) -n\log(n) -(x-n)\log(x-n)}\\ \end{align}

The bottom is equation is far less limiting in modern computers. With my desktop and Python compiler, I can use the non-Stirling formula with samples no larger than 10k. With Stirling's approximation, I can solve for probabilities with samples in the millions without difficulty.

## Working in reverse

Unfortunately, up to this point we have been viewing the problem backwards relative to our intuition and that can cause subtle but important mistakes. We solved the problem given the prevalence is x, what are the odds the measured prevalence is b? We want to solve the problem given the measured prevalence is b, what are the odds the real prevalence is x? Our solution here is similar to p-values in this way. With p-values and here, every informed discussion must emphasize the following:

$$P\left( \text{observation}|\text{hypothesis}\right) \neq P\left( \text{hypothesis}|\text{observation}\right)$$

Part II of this discussion covers statistical inference, the methods used to estimate probability distributions based on experiments.

## References

 D. A. McQuarrie, "Introduction and Review," in Statistical Thermodynamics, pp. 1—34, 2000.

 K. A. Dill, and S. Bromberg, "Entropy and the Boltzmann Law," in Molecular Driving Forces, pp. 81—92, 2011.

 M. S. Shell, "Equilibrium and Entropy," in Thermodynamics and Statistical Mechanics: An Integrated Approach, pp. 6—20, 2015.