


|

Introduction to Statistics for Pathologists
with Emphasis on Evaluating Laboratory Tests
Robin T. Vollmer, MD, MS

Click here to download handout in 1-per-page format (431KB)

Click here to download handout in 6-per-page format (258KB)

Probability
Without probability theory there could not have been any of the techniques of statistical analyses. And
for the most part without p values, which are probabilities, and more specifically without small p values
there can be no publications. Thus, the beginning and the end must deal with probability.

So what is probability? There are at least 3 notions of it: one a little base, one based in labor and
detail, and one lofty and abstract. The beginning of probability theory had to do with gambling, more
specifically questions raised by a 17th century gambler to his friend, a French mathematician.
When we speak of probability as the chance of some event, we are historically correct, and modern
probability still deals with issues of chance and gambling. The second notion deals with scientific
experimentation. If in a series of experiments we observe how often a particular event occurs and divide
that number by the total number of experiments, then this ratio is an estimate of the probability of the
event. In both notions the probabilities comprise numbers between 0 and 1, 0 for an event that never may
occur and 1 for an event that is certain to occur. Part of the third notion of probability, then, is a
function whose range is a real number on the interval from 0 to 1.

The third notion involves set theory and abstract mathematics. Here, a probability space is a triple
(S,B,P), and you cannot speak of the probability function P, without also referring to two sets: S and
B. S is the sample space or a set whose elements comprise the outcomes of an experiment, and B is a set
of subsets of S and with particular features. First, each member of B is called an event. In addition,
B must include the null event φ. In other words, B must include the event with probability of 0.
Second, B must be closed under complementation and countable unions. In other words, if an event D+ is a
member of B, then not D+ or D0 must also be a member of B. And if D1 and D2 are members of B, then the
set {D1 or D2} must belong to B. In mathematics, these features make B a σ field and (S,B) a
topological space. Finally, P is defined so that P(S) = 1.

An example will help make these ideas less abstract. S is a set whose elements comprise patients in a
study. D+ is the subset of the patients in S who have a particular disease, so we make D+ an event in B.
Because those without the disease must also be included in B, the subset D0 is in B. Now suppose that
patients in S comprise a collection with two grades of the disease (1 and 2). If D1 and D2 are events in
B, then the event either D1 or D2 (i.e. either grade 1 or grade 2) must also belong to B. Then, with
these building blocks completed we can speak of the probability of a patient having D+, D0, D1 or D2.

What do these mathematical details buy? They lead to some interesting properties of probability spaces,
some of which have direct clinical application, and as illustrated above they lead to p values upon which
we depend. The summary properties of probability are:

| P(S) =1 | (1) |
| P(φ) =0 | (2) |
| P(D+) = 1- P(D0) | (3) |
| P(D1 or D2) = P(D1) + P(D2) - P(D1 & D2) | (4) |

Two events, D1 and D2, are mutually exclusive if the probability of observing both is zero, i.e. if P(D1
& D2) = 0.

Two Events, D (Disease) and T (Test) and Conditional Probabilities
Let us begin with the sample space of patients. Now consider two events. The first is D+, the presence
of a specific disease. The second is T+, the presence of a positive test. The probability of observing
D+, i.e. P(D+), is the prevalence of the disease. The probability of observing a positive test is P(T+)
or the prevalence of a positive test.

The probability of observing one event D+ conditional upon a second event T+ also being observed is
P(D+/T+), which is called a conditional probability. In medicine this particular conditional probability
is the positive predictive value of the positive test T for the disease D. The mathematics of probability
indicate that P(D+/T+) is calculated as:

 
A second way to think of this is that P(D+/T+) becomes the new P for a sample space when all patients
have T+, because P(T+/T+) = 1. A third way to view P(D+/T+) is as part of a product. Specifically, the
probability of simultaneous occurrence of two events D+ & T+ is the product of P(D+/T+) and P(T+):

| P(D+ & T+) = P(D+/T+) * P(T+) | (6) |

The two events, D+ and T+ are said to be statistically independent when P(D+/T+) = P(D+) (or equally when
P(T+/D+) = P(T+), and in this circumstance:

| P(D+ & T+) = P(D+) * P(T+) | (7) |

This is the only true notion of statistical independence. More commonly the term independent is misused,
as in, for example, a statement like, "tumor stage and grade provided independent prognostic information
in the Cox model of survival time". This use of independent is most often incorrect, because in fact
such prognostic variables are often closely related to one another, that is, equation 7 does not hold.
For example, stage and tumor grade are seldom independent of one another, because the conditional
probability of observing high stage given high grade is not equal to the probability of high stage, i.e.
for most tumors

| P(High Stage / High Grade) ≠ P(High Stage) | (8) |

Instead, for most tumors

| P(High Stage / High Grade) > P(High Stage) | (9) |

Instead of independent information, what these two variables can provide in multivariable models is
additive prognostic information.

Random Variables, Distributions and Distribution Functions
Random variables come into play when the events are numerical or when they can be mathematically mapped
into numbers. For example, in the original sample space the elements might comprise patients (or their
biopsies), but the data we use are numerical measurements like tumor grade, tumor size, and time to tumor
recurrence. To see this more explicitly consider a typical clinical data set with the patients in rows
and with their specific random variable observations detailed in columns as follows:

| Patient | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
| 1 | 1 | 5.0 | 1 | 10.0 | 25. | 2 | 0 |
| 2 | 1 | 4.5 | 1 | 9.5 | 30. | 1 | 1 |
| 3 | 0 | 6.0 | 1 | 9.0 | 45 | 4 | 1 |

and so forth.

Here, the original sample space comprises patients 1, 2, 3, etc., and the x1-x7 are numerical
measurements, which for the most part can be considered random variables. (In regression analyses we
sometimes will consider them to be fixed measurements relatively insensitive to issues of chance and
probability.) Key random variables for many medical studies are the outcomes like the presence or
absence of a binary event (coded numerically as 0 or 1) or the time to some binary event such as death.

Random variables can be usefully classified into two types: discrete and continuous. Discrete random
variables take only specific numerical values, such as 0 versus 1 (a binary variable) or 1, 2, 3, 4 and 5
(a categorical variable), although they may include an infinite number of values (e.g. 0, 1, 2,
∞
). An example of a discrete random variable is tumor grade.

The Calculus of Probability
The distribution function is the representation of probability for a random variable x. It is often
symbolized abstractly as F(x), and when it refers to a specific value for x, like x=a, then it is
symbolized as F(a). F(a) is defined as the probability that x takes a value less than or equal to a. In
other words, F(a) is the probability of a specific interval [-∞ ,a] on the real line:


The probability density f(x) is related to F(x) through calculus. For example, the probability density
f(x) at a point is a differential defined mathematically as

 
Thus, f(x) is the differential of F, because

 
If x is a continuous variable, then:

 
with the limits of the integration occurring from - ∞ to a. If x is a discrete random variable that
can take only certain values, then integration becomes a sum at the discrete values of x, and

 
It is useful to think of f(x) as a histogram. For example, the figure below shows the observed f(x) for
Gleason score in a series of 100 prostate needle biopsies (data from Humphrey).


By contrast continuous random variables can take any value on the real line from -∞ to + ∞.
An example of the probability density for values of serum PSA in men without cancer is illustrated below.
The data of Babaian et al. were used and fitted with the exponential distribution to result in
exp(-PSA/2.28)/2.28. (Strictly speaking because PSA can only take values greater than or equal to zero,
the distribution for PSA was arbitrarily set at 0 for PSA values less than 0, and only positive values
are shown in the plot.)


This distribution demonstrates that for most men with negative biopsies, PSA is less than 5 ng/ml.

For many types of random variables, x, the f(x) comprise familiar mathematical functions. For example,
the binomial distribution often works for a binary x. For continuous x's there are the normal,
chi-square, exponential, F, t and gamma distributions.

Statistics
Typical data for a clinical-pathologic correlation comprise a series of patients (the original sample
space) and their associated numerical data (the random variables), and rather than collecting all
possible patients we stop with a sample of size n. If there is just a single outcome random variable, x,
then such data would consist of the series: x1, x2,
, xn of numerical values. If the underlying
distribution for x is f(x), then the sample we have collected is considered a random sample if, and only
if, the values x1, x2,
, xn can be considered statistically independent from one another, that is, if
the probability distribution of collecting this series is simply the product of observing each individual
xi value. In other words,

| f(entire series) = f(x1) * f(x2) *
* f(xn) | (15) |

This product is called a likelihood function, and it plays an important role in estimation of statistics
as well as in test of hypotheses. Although we have assumed the patients and their random variable
values, x1, x2, x3, etc. are statistically independent, it is easy to imagine how this might not be the
case. For example, consider a multi-institutional clinical trial which records tumor grade as a random
variable. Patients from one institution might have their tumors graded systematically different from
another, simply because of grading preferences by institutional pathologists. Such patients would not
then comprise a random sample with respect to the random variable grade of tumor.

One way to summarize the series {xi} is to calculate its mean. The mean is an example of a statistic,
which is a real valued function on the series {xi}. The mean is also the expected value of the random
variable, x, and its formula is:

 
The sample variance is the expected value of (x mean)2 and is a measure of the dispersion of x around
the mean. Its formula is:

 
The standard deviation is the square root of the variance. Another useful descriptive statistic is the
median, which gives the midpoint of the range of x.

What mathematical statisticians have done over the past 100 or so years is to develop many practical
results based on random variables and statistics. More recently, applied statisticians and software
engineers have developed strong software packages which make many of these results and tools available to
any of us with a desktop computer and data. As long as the fundamental rules regarding probability,
random variables, and sampling are satisfied, we can then reach conclusions based on the analysis of the
data.

Reference Limits for a Clinical Random Variable, x
Historically, we have referred to "normal" limits, and by this we meant the limits of x expected in
normal persons. Now, however, it is less certain what comprises a normal population, or even if such a
grouped result is an appropriate reference for clinical decisions. Nevertheless, the expected values of
x for some reference group comprise a popular result, which clinicians often demand. Potential reference
groups used include students, blood donors, laboratory technologists, hospitalized patients without the
disease for which the test is used, etc. The use of any or all of these groups is problematic but
dealing with such issues lies outside of statistics.

The orthodox reference question is: what is the range of values of x expected for 95% of the reference
population? We seek a lower value xl and an upper value xu such that:


and


In other words the probability of observing a value of x between xl and xu should be 0.95, or


The parametric approach to determining xl and xu is to first examine the frequency distribution of x in
the reference population. If this distribution matches f(x) for a normal probability density, then

| xl = mean 2* standard deviation | (21) |

and

| xu = mean + 2* standard deviation | (22) |

If the frequency distribution of log(x) in the reference population matches f(x) for a normal probability
density, then

| xl = exp{ mean(log(x)) 2* standard deviation(log(x)) } | (23) |

and

| xu = exp{ mean(log(x)) + 2* standard deviation(log(x)) } | (24) |

If the frequency distribution of x in the reference population matches f(x) for an exponential
probability density, then


and


For example, consider Morgan et al.'s 1996 PSA data collected on 3475 men without prostate cancer and
1783 men with prostate cancer. PSA was given in unit values. Here is the frequency distribution of
those without cancer.


Clearly, this distribution of PSA does not appear to be normal. Perhaps it is log-normal? We can test
this visually by examining the distribution of log(PSA), and the following is what we get


Once again, it does not appear to be normal. The following shows how the frequency distribution of PSA
in this data matches an exponential distribution (line on the plot).


This looks like a good match, so we conclude that PSA in men without prostate cancer appears to follow an
exponential probability density. The 95% reference limits we get then are xl =0 and xu = 6.1 ng/ml (i.e.
three times the mean of this data). Whenever we are dealing with a clinical test that always gives
values > 0.0, we should consider the exponential probability density. It may be more appropriate than
either the normal or log-normal density functions.

The non-parametric approach to reference limits is simple. Suppose we have 100 patients in the reference
sample. We put their values of x in order from smallest to the largest. Then we count off the first
2.5% the x value at this point is xl, which in our case would be midway between the 2nd and
3rd persons values. Then we count from the bottom to the 97.5% (or equally 2.5% from the
highest), and the value here would be xu, which in our case would be midway between the 97th
and 98th persons' values.

The problem with reference limits is that they ignore important issues. For example, they deal just with
the reference population, not the distribution of x in the diseased population. In other words, they
deal with the conditional probability P(x outside range | reference pt.), but they do not address the conditional probability P(x | diseased pt.). For example, in the Morgan et al. data reference limits deal with just the upper
distribution in the following plots, but not the lower.


Clearly, the lower distribution is important to the interpretation of PSA, especially in that overlap
zone between 2 and 10 ng/ml, and in this regard reference limits deal with just half the problem.

Although reference limits can be adjusted for other variables that may be important, such as age and race
for PSA, they do this awkwardly. Consider, for example, how the 95th upper limits can vary in
the Morgan et al. PSA data because of age category and race.

| Age | Race | 95th Percentile (ng/ml) |
|
| 40-49 | Black | 2.4 |
| 40-49 | White | 2.1 |
|
| 50-59 | Black | 6.5 |
| 50-59 | White | 3.6 |
|
| 60-69 | Black | 11.3 |
| 60-69 | Black | 4.3 |

Clearly, the 95% confidence limits depend on age and race.


Clinical Conditional Probabilities (Dealing with the Diseased Population)

Whereas reference limits ignore the diseased population (except to exclude them), population conditional
probabilities like the sensitivity, the specificity, the positive predictive and the negative predictive
values of a test deal directly with the disease. If the diseased group is symbolized as D+, the
non-diseased as D0 and if those with a positive test are T+ and those with a negative test as T0, then
these values can be defined in terms of conditional probabilities as follows:

| Sensitivity = P(T+/D+) | (27) |
| Specificity = P(T0/D0) | (28) |
| Positive Predictive Value (PPV) = P(D+/T+) | (29) |
| Negative Predictive Value (NPV) = P(D0/T0) | (30) |

Both sensitivity and specificity are of special interest to pathologists, because they emphasize the test
result as an outcome. They also suggest a natural experiment, namely the collection of diseased, D+, and
non-diseased, D0, groups and the subsequent measurement of the test T on both populations. By contrast,
PPV and NPV emphasize the presence or absence of disease as the outcomes, so that these two measures are
of greater interest to clinicians and their patients. PPV and NPV also suggest an alternative
experiment, namely, the collection of those with and without a positive test with subsequent assessment
of both groups for the disease. A third, and perhaps best way to do the experiment is to prospectively
collect a group of patients and assess them simultaneously for the test and the disease.

Regardless of which experimental design is used, the results can be tabulated as a 2x2 table as follows:

| T+ | T0 |
| D+ | True Positive | False Negative |
| D0 | False Positive | True Negative |

Henceforth we will symbolize these entries as TP, FN, FP and TN, and the sum of the entries (i.e. total
number of patients) will be n. An example comes from Babaian et al's 1992 work on serum PSA in prostate
cancer. Serum PSA was assayed by monoclonal antibody in 404 prospectively collected men who also
underwent needle biopsy of the prostate. Babaian used the cutpoint of > 4 ng/ml as a positive PSA
(i.e. T+) and presence of cancer in the biopsy as D+. His 2x2 table results are as follows:

| PSA>4 | PSA<=4 |
| CA | 48 | 21 |
| Benign | 58 | 277 |

By using the relative frequency of events as estimates for probabilities, we find that the sensitivity
for this test is 48/(48+21) or 0.70. The specificity is 277/(58+277) or 0.83. The PPV is 48/(48+58) or
0.45, and the NPV is 277/(21+277) or 0.93.

The first issue to consider is whether or not the disease D+ and test T+ are statistically independent of
one another. If they are independent, then the test is worthless for diagnosing the disease. From the
rules of pathology, we know that two events, D+ and T+, are independent if and only if:

| P(D+ & T+) = P(D+) * P(T+) | (31) |

Let begin with this assumptionit is called the Null hypothesisand then test to see if it is reasonable.
We estimate P(D+) as (TP + FN)/n and estimate P(T+) as (TP + FP)/n. Then we compare the TP with the
number expected with the Null hypothesis, which is n*P(D+)*P(T+). For the TP entry of Babaian's table,
the expected number from the Null is 18.1, and the observed is 48. These appear quite different. If we
do the same comparison for the remaining 3 entries in the table and then sum the square of the
differences between observed number and the expected number with the Null hypothesis, the result is a
statistic that follows the chi-square distribution as long as the entries in the table exceed 5. For
Babain's data, the chi-square was 78, and this implied a p value of 0.0. Consequently, we can safely
reject the Null that prostate cancer and PSA are statistically independent from one another.

Test of the Null Hypothesis
The above was an example of statistical testing using the Null hypothesis (or for short just the Null).
Whereas what we often seek is to discover whether a key variable is significantly related to an outcome,
most statistical tests begin by assuming that there is no such relationship. The Null can take a variety
of forms. For example, the Null might be that the means of a random variable from two populations are
the same. In survival analysis the Null might be that the median survival time for the two populations
is the same. In a regression analysis the Null would be that the coefficient relating the dependent
variable to the independent variable is zero. Mathematical statisticians have solved the details for
large numbers of tests of the Null, and programmers have incorporated the results into software, but the
details of the Null can be hidden from general users of the programs. In general, the logic proceeds as
follows. First, some specific assumptions are made about the random variables and their statistical
distributions including the likelihood function, and it is also assumed that the observations collected
on different individuals are statistically independent from one another. From this beginning, we
additionally assume the Null hypothesis and then often use the likelihood function to estimate the value
of one or more statistics, s, from the data. The software then provides the predicted probability of
observing key values for s, given the Null hypothesis. That predicted probability is called the p value.
Specifically, if x is the value we have obtained for s, then for a two-sided test:

| p value = P( |s| >= a / Null) | (32) |

In other words, the p value is the conditional probability that the absolute value of s will exceed or
equal a given that the Null hypothesis holds. If the p value is small, we reject the Null.

The classical approach formulates the ideas of the Null hypothesis into a truth table where the Null may
or may not be true and where we make a binary decision of accepting or rejecting the Null after observing
a value of s.

| Null is True | Null is False |
| Accept Null | No Error | Type II Error |
| Reject Null | Type I Error | No Error |

In this table are two errors, Type I and Type II. If s has taken a value in some critical region like s
>= a or s <= -a, then we reject the Null hypothesis. A Type I error occurs when the Null is true
but the value of s has fallen into the critical region, so that the probability of making this error has
the same form as the p value, i.e. P( |s| >= a / Null). Because no one likes making errors, in
general the value a is selected so that the probability of a Type I error is small, typically <0.05 or
<0.01, and this chosen level is symbolized as α. With this attitude neither the exact value of
the statistic s nor the magnitude of its calculated p value count, that is, the continuous nature of the
information provided by the calculated p value is ignored.

By contrast a Type II error occurs when the Null hypothesis is false, and the value of the statistic we
measure leads us to accept the Null hypothesis. The probability of making a Type II error is symbolized
as β, and 1-β is called the power of the test. To estimate either β or the power requires
some knowledge about alternatives to the Null. If the alternative to the Null is a simple one like s
takes a particular value outside the critical region, then usually β and the power can be
determined. However, more often the alternatives to the Null are not simple. For example, the Null
might be that a coefficient takes the value of zero, and the alternative would be that the coefficient
has any one of a number of non-zero values. In general for a fixed sample size the probability of making
Type I and Type II errors are inversely related to one another, that is, as α is chosen to be
smaller, β will rise and vice versa. The only way to have both α and β small is with an
adequate sample size. This leads to and supports the common sense notion that with large sample sizes,
one can expect lower p values and lower probability of making Type II errors.

Determining an adequate sample size for simple statistical tests involving, for example, two binary
random variables is relatively straightforward with published formulas and tables readily available.
However, finding the appropriate sample size for more complex statistical models with multiple variables
is not straightforward, so that one must rely on a variety of assumptions and sometimes sophisticated
software. A few guidelines, nevertheless, can be given. 50 patients comprise a small study that will
allow for analysis of several variables, and this size study may provide important preliminary details
that allow further estimates of sample size and power to be calculated. If all 50 have hard outcomes,
then such a study might be more powerful than one with 100 patients and missing data or censored
patients. Studies with 100 to 500 patients will allow significant multivariable analyses as long as
there are at least 10 to 15 outcomes for each independent variable in the analysis (Peduzzi, 1996).
Whereas these rough rules should work for most studies in pathology, grant applications, clinical trials
or more formal study proposals expect better estimates of sample size and power. To achieve such
estimates often requires the help of a professional statistician and specialized software, and sometimes
the estimates of sample size are greater than what is needed.

Finally, one additional generalization about tests of hypotheses and power seems to apply. Parametric
tests that rely on a specific distribution for the data, such as the normal or chi-square distributions,
are generally more powerful for any given level of α than their non-parametric counterparts.

P < 0.05?
A p value less than 0.05 has become nearly synonymous with statistical significance. In other words,
with a p<0.05 we commonly reject the Null, because the chance that this result is due to random
fluctuation within the Null is 1 in 20. Others may prefer an α of 0.01. But what about p<0.06,
which implies that there was a 1 in 17 chance of the result occurring due to randomness? In fact, no
choice of α is justified by either mathematical or statistical theory. Rather, the choice is
arbitrary, as is any cutpoint in the continuous p. Here we face the antithesis between the discrete and
the continuous. The binary decision of accepting or rejecting the Null is related to the continuous p
value through a relatively arbitrarily chosen α level. I suggest that instead of being trapped in
this dilemma we should step away from the classical paradigm of accepting vs. rejecting the Null and
consider all the information provided by the magnitude and setting of the p value.

In general p values result from a statistic, like a chi-square, t value or an F, and each has its own
distribution. Large values of chi-square, t or F correspond to low p values, and the magnitude of the
statistic relates directly to the strength of the association between the variables of interest. Thus
the p values and their corresponding statistics provide more information than simply whether or not to
reject the Null. They tell us the strength of association with outcome and the order of importance of
explanatory variables. For example, in a regression analysis, where we examine how a key outcome depends
on one or more explanatory variables, the variables most strongly related to outcome will be the ones
with the largest statistic and the lowest p values. These will also be the ones which explain more of
the variance in the data.

The values of the statistics and their p values should also be considered in the context of the study. P
values of 0.01 to 0.04 for variables in a study of 400 patients with many hard outcomes may not be very
important, especially if they are expensive to observe and are poorly reproduced from one study to the
next. When such variables have been selected from a larger pool by elimination based on the data, their
statistics may be inflated due to what Harrell describes as phantom degrees of freedom, and it is not
unusual for such variables not to be validated in repeat studies. By contrast variables in the same
study with p values of < 0.0001 are likely to be validated in new studies, although often with larger
p values. At the same time other variables with p values larger than 0.05 may be important. For
example, a commonly known and cheaply observed variable like gender, which may have a p value of 0.06
0.10 could be important, especially if one wants to control for gender while testing the importance of
other explanatory variables.

ROC Curves
In addition to the chi-square test for independence between a disease D and a Test T, the sensitivity and
specificity derived from the 2 x 2 table lead to a graphical test for the importance of the test. This
is the ROC curve, which is a plot of sensitivity on the vertical axis versus 1-specificity on the
horizontal axis. To repeat, sensitivity is the conditional probability P(T+/D+), i.e. the probability of
a positive test in a patient with the disease. Specificity is P(T0/D0), i.e. the probability of a
negative test in a patient without the disease. Because specificity is a conditional probability, it
follows the rules of probability. Thus, 1 specificity is the probability of a false positive test or
P(T+/D0). Specifically,

| 1 P(T0/D0) = P(T+/D0) | (33) |

Because the sensitivity is the true positive probability or P(T+/D+), what the ROC curve does is to
compare the true positive probability (i.e. sensitivity) against the false positive probability. An
example follows.


On the plot are two studies regarding the use of serum PSA and the presence of prostate cancer on needle
biopsy. The curved line shows the results obtained by Catalona et al., and the points are from the study
by Brawer et al. Each bend in the line for Catalona's data and each point for Brawer's data represent
different cutpoints used for defining a positive value of PSA. The straight line shows where P(T+/D+)
would equal P(T+/D0). Any points falling along this line indicate when the true positive rate equals the
false positive rate, a situation for which the test would be worthless, because a positive result would
not change the odds for disease. For example, using the rules of conditional probabilities and
straightforward algebra demonstrates that for points on the straight line:

| Sensitivity = 1 Specificity | (34) |

| or P(T+/D+) = P(T+/D0) | (35) |

 

 
Dividing both sides by P(T+) and algebraically rearranging the terms leads to:

 
This result implies that

| Odds of Disease Given T+ = Odds of Disease Regardless of T | (39) |

In other words, when the sensitivity equals 1 specificity, the test provides no useful information
about the presence of D. All this flows from the definition of conditional probabilities and simple
algebra.

By contrast an ideal test should have a sensitivity of 1 and a false positive rate of 0. In other words
for an ideal diagnostic test, P(T+/D+) should equal to 1, and P(T+/D0) should equal to 0. Thus, the point
for the ideal test should be in the upper left corner of the ROC curve. Furthermore, the farther the
observed points appear from the straight line and the closer they appear to the upper left corner, the
better the test is. Another way of saying this is that an ideal test should have a large area under the
ROC curve.

The way the D0 population is identified is important and affects the result of P(T+/D0) and consequently
the appearance of the ROC. For example, in both Catalona's and Brawer's studies, the D0 groups were
defined as those with negative prostate biopsies or who did not have biopsies because of low values of
PSA or normal digital rectal exam.

Morgan et al. defined the D0 group to include men who did not have biopsies but who were thought unlikely
to have prostate cancer because of clinical and follow-up information. An ROC plot comparing the results
of Catalona et al's study for which all underwent biopsy of the prostate with those of Morgan et al.
follows. The line represents Catalona et al.'s ROC, and the points represent Morgan et al.'s study, and
these two ROC plots demonstrate that PSA is a better test for a large population without indication for
biopsy than it is for a population with indications for biopsy. The plots also demonstrate the
importance of careful choice of D0 populations.


In addition to demonstrating how close, or far, a test's cutpoints may lie from the non-ideal line, the
ROC curve can be used to compare the results of different tests or different observers. In the above
figure the results for Catalona's group are reasonably close to those for Brawer's group suggesting that
their results are comparable. The next figure shows the comparison between the ROC curve for PSA (line)
versus that for PSA density (points) which is PSA divided by the prostate volume as measured by
ultrasound.


The fact that the ROC curves appear close suggests that PSA and PSA density provide about the same
diagnostic information.

Finally, the next ROC compares the diagnostic ability of several cytopathologists to detect atypical
cells in urine cytology specimens. The data come from VanderPol et al.


In this ROC plot the points represent four different cytopathologists using different thresholds for an
abnormal urine cytology specimen, and the gold standard for the study was obtained from cystoscopic exam,
but not biopsy. Because some points lie close to the non-ideal line, urine cytology is not a helpful
diagnostic test, at least for some thresholds and some cytopathologists. In summary, ROC curves provide
a useful visual tool for comparing tests, for comparing different studies of a particular test, or for
comparing the diagnostic ability of different pathologists.

Bayes' Rule
Whereas those evaluating a new diagnostic test often dwell on the sensitivity and specificity of the
test, of more practical importance for patients and clinicians is the PPV of the test, i.e. P(D+/T+), or
the NPV, i.e. P(D0/T0). These are the conditional probabilities that can tell them the chance of
having, or not having, the disease, given a particular test result. Bayes' rule allows us to relate
several of these conditional probabilities to one another and can be written for the specific sets D and
T as:

 
P(D+) is the prevalence of the disease, so that the positive predictive value, P(D+/T+) can be seen as:

 
In practice, the easiest part of this formula to estimate is the sensitivity of the test. All one need
do is collect a number of patients with disease and see how many have a positive test. More difficult to
know are the specificity and prevalence of the disease. To estimate specificity and prevalence, one must
collect larger numbers of patients who do not have the disease, or at least collect a sufficient number
without the disease and for whom the test might be applied. Estimates of both specificity and prevalence
reflect the make-up of the D0 sample. Consequently, the PPV for a test may have different values for
studies with different D0 groups, and failure to collect sufficient numbers and types of D0 patients can
also yield misleading results. What probably is the trickiest situation is when the published positive
predictive value came from a study with one D0 group and is now applied to some broader, or different D0
population. For example, consider the sensitivity and specificity of serum PSA for detecting prostate
cancer. Early studies published the sensitivity and specificity of various cutpoints of PSA on patients,
all of whom underwent needle biopsies of their prostate. Consequently, the D0 sample comprised men who
had some clinical indication for the biopsy, such as an enlarged prostate, obstructive symptoms or
palpable abnormality. Now, PSA is used as a screening test on men, who are mostly asymptomatic and will
not undergo biopsy.

An Application of Bayes Rule to the Diagnosis of Prostate Cancer
Consider the positive predictive value of prostate cancer, given that serum PSA exceeds a certain
threshold value of x ng/ml. Here, P(D+ | T+) is written as P(Ca | PSA > x), i.e. the probability of
prostate cancer (Ca) given that PSA exceeds x. Bayes rule tells us that this can be written as:

 
where B9 symbolizes the benign population. Recognizing P(PSA > x | Ca) as the sensitivity and P(PSA
> x | B9) as the false positive test probability (FP) and dividing both the numerator and denominator
of equation 42 by the product in the numerator simplifies the result to:

 
I consolidated published results from the literature for age and race specific sensitivities and
specificities of serum PSA (over 2700 men with prostate cancer and over 99,000 without prostate cancer),
fitted these values with distribution functions (exponential for FP and a combination of gamma and
exponential for sensitivity), and obtained an age and race specific value of P(Ca) from the SEER data.
Then I used these results in equation 43 to estimate the positive predictive value of prostate cancer.
Here, are the plotted results for black men of several age groups.


The plots demonstrate that the positive predictive values rises with cutpoint in serum PSA as expected,
that the lines for several age groups are overlapping, but that the positive predictive value is highest
for oldest men in the low values of PSA and highest for young men for the highest values of PSA. These
differences at each end of the plot reflect the importance of P(Ca).

Odds, Odds Ratios, and Other Ratios of Probabilities
In epidemiology ratios of probabilities, especially ratios of conditional probabilities, are popular.
The odds for having the disease D is calculated as:

| Odds = P(D+)/ (1-P(D+) ) | (44) |

The odds ratio of a disease with or without the positive test T is:

 
Another odds ratio comes from the comparison made in equation 41. A good clinical test occurs when the
odds of disease given the test result is much higher (or lower) than the odds of disease without use of
the test. This odds ratio is given as:

 
Notice that the probability P(D+) in the denominator does not utilize the test result, and this is
sometimes called the a priori probability of the disease. Using Bayes' theorem and algebra, this odds
ratio can be rewritten as:

 
The relative risk of having the disease for those with and without a positive test is defined as:

 
Whereas the above emphasize the disease as an outcome, the likelihood ratio emphasizes the test and is
given by:

 
(This likelihood ratio differs from another one used in general testing of hypothesis in statistics, so
that it may be better to symbolize it as likelihood ratio (lab)that is, as in for a laboratory test, T.)

In Babaian's 2x2 table on PSA, the odds ratio A was ~(.453/.547)/(.07/.93) or 10.9, the relative
risk was ~.453/(1-.93) or 6.43 and the likelihood ratio (lab) was 0.696/(1-.827) or 4.02. (The
~ marks above indicate approximations because of truncation of the fractions given.)

Overview of Common Statistical Tests
The approach that follows would probably make professional statisticians cringe, because it prescribes a
simplified approach to data analysis which can be complex. Nevertheless, I have found that this approach
helpful, because it provides a way to categorize data analyses according to the types of random variables
present. I recommend the reference texts for further important considerations. To begin, we are
interested in one outcome, or dependent, random variable, which we symbolize as y, and our questions in
analysis concern how y depends on one or more explanatory random variables, which we symbolize as x. The
analysis we choose depends on the key characteristics of y and x as summarized below.

| Key Characteristics of Random Variables |
| Continuous |
| | Infinite |
| | Always positive, like time. |
| Discrete |
| | Binary like 0 vs. 1 |
| | Categorical like 1, 2, 3, etc. |
| Outcomes Paired ? |
| Single vs. Multiple x variables |

The approach we take depends upon these key characteristics of the dependent and explanatory variables,
and finally one needs to consider tests based on a parametric model versus a non-parametric or
semi-parametric model.

| y | x | Paired | Multiple x | Parametric | Test or Model |
| binary | binary | no | no | no | Chi-square |
| binary | binary | yes | no | no | McNemar Chi-square |
| binary | binary | no | no | no | Fisher Exact |
| binary | binary | no | 3 | no | Fleiss Test of equal proportions |
| binary | binary | no | 3 | no | Mantel-Haenszel |
| categorical | categorical | no | no | no | Chi-square |
| continuous | binary | yes | no | yes | Paired t test |
| continuous | binary | no | no | yes | t test |
| continuous | categorical | no | no | yes | One Way ANOVA |
| continuous | binary | no | no | no | Wilcoxon |
| continuous | categorical | no | no | no | Kruskal-Wallis |
| continuous | continuous | yes | no | no | Correlation |
| continuous | any | yes | any | yes | Linear Regression |
| variable | variable | yes | any | yes | GLM |
| binary | any | yes | any | no | Logistic |
| Failure time | categorical | yes | no | no | Log-rank |
| Failure time | any | yes | any | semi | Cox Model |
| Failure time | any | yes | any | yes | Exponential, Weibull |

Chi-Square Tests
The Chi-square test of statistical independence between two binary or categorical y and x variables has
been the workhorse of medical statistics for decades. The Null for this test assumes that y and x are
statistically independent, that is, that P(y/x) = P(y) and P(x/y) = P(x). Then the software estimates
the probabilities P(y) and P(x) from the pooled data and without regard to each other. Next, the test
compares the number of observed results for each category of y and x with that expected from the pooled
estimates of P(y) and P(x). Specifically, it forms ratios comprising the squared difference between
observed and expected numbers divided by the expected number as follows.

(no. observed no. expected)2/no. expected

If there are r categories or possible values of y and c possible categories for x, then the total number
of possible categories using both variables is r * c. If the numbers of observations in each
combined category exceeds 5, then the sum of these ratios over all categories of y and x has an
approximate chi-square distribution with (r-1) * (c-1) degrees of freedom, so that the
statistic for this test follows a chi-square distribution. When the estimated chi-square is sufficiently
large, then the deviations of observed from expected numbers are high, and the Null is rejected. For
example, the 2x2 tabular data from Babaian et al. on PSA and a positive needle biopsy of the prostate
yielded a chi-square value of 78 and a p value close to zero. Thus the Null of statistical independence
between a PSA>4 ng/ml and presence of cancer in the biopsy could be rejected.

When the numbers in each combined category are smaller than 5, then the statistic does not follow a
chi-square distribution, and one must resort to an alternative test such as the Fisher exact test, which
relies on the geometric distribution.

Sometimes the categorical observations of y and x variables are paired. For example, if one were to
evaluate two immunohistochemical stains on a set of tumors, one tumor from each patient. In this study
the staining results would be paired, and in this circumstance one should not use the routine Chi-square
test of independence. Instead, the McNemar variant of the chi-square test is more appropriate, and its
chi-square statistic relies on just the discordant results for each pair of staining results.

Another variant of the chi-square test is required to test when one wants to test if the observed
proportion with a positive result is the same across several studies. For example, the following table
shows the observed probability that a patient had cancer of the prostate given a PSA value >= 4 ng/ml
across four different studies (Babaian et al, Catalona et al, Brawer et al, and Morgan et al). The
probability is listed as ppv, and the total number of patients is listed as n.

| Study | n | ppv |
| Babaian | 404 | 0.45 |
| Catalona | 750 | 0.34 |
| Brawer | 227 | 0.34 |
| Morgan | 5258 | 0.76 |

Whereas the ppv for the first 3 studies appear reasonable close to one another, that for the last study
is approximately twice as high. Are these results significantly different from one another? The test of
equality of proportions can provide an answer. First, a weigted estimate of the overall proportion
positive is calculated. Then this estimate with its derived variance is used to once again calculate the
difference between observed and expected values as above, and the result is another chi-square statistic.
In S-PLUS this test is done with the call to prop.test, and for these data it yielded a chi-square
statistic of 358 (p~0) for the Null that the observed proportions were the same. Thus, we can
conclude that there were significant differences between these 4 studies, and in fact the design for the
first 3 differed from that of the 4th. Whereas, the first 3 assayed serum PSA in men who
underwent biopsy of the prostate, the 4th collected PSA data from 2 populations, one selected
because they had a positive biopsy for prostate cancer and the second or control group because they had
no evidence for prostate cancer in follow-up. In the 4th study all of the patients in the
first group had biopsies of the prostate, but few in the second group were biopsied. In other words the
non-cancer groups in the first 3 studies included many men with BPH; whereas, the non-cancer group in the
4th study included many who had no abnormality of their prostate. The differences between the
2nd and 3rd studies and the fourth can also be viewed in the prior ROC curves,
which demonstrate a higher sensitivity for the 4th study at any given point of 1
specificity.

Finally, the Mantel-Haenszel chi-square test is done to test for statistical independence between y and x
when there is a third confounding variable present. The third variable could be the presence of another
disease, a drug or that the observations came from different institutions or overall categories. For
example, Morgan et al published the frequencies of patients with prostate cancer (y variable) versus PSA
levels >= 4 ng/ml (x variable), stratified by 8 combined categories of age and race. A
Mantel-Haenszel test for the relationship between presence of cancer and PSA controlled for these 8
different categories produced a chi-square value of 2519 (p~0), which of course shows that there was
a significant relationship between PSA and presence of cancer.

t Tests
When one wants to test whether the means of a particular random variable, y, are the same in two separate
populations, the t test is appropriate, so long as y is normally distributed and its variance is the same
in the two populations. Consider the follow example. Serum PSA was measured on men prior to
prostatectomy, and at prostatectomy the men were sorted into 2 groups: the first without extra-capsular
tumor and the second with extra-capsular tumor. The question to consider is whether these two groups had
the same mean of PSA. Because PSA was not normally distributed, we used the natural logarithm
transformation, and the following plot shows that log(PSA) in these 2 groups was approximately normally
distributed (the top is for those without extra-capsular tumor and the bottom is for those with
extra-capsular tumor).


Although these graphs do not demonstrate a large difference in PSA, the t test showed that the means for
the 2 groups were significantly different (t = -2.2, p = 0.02). The t test can also be used for paired
data.

Once commonly used, the t test has now become less popular, and this is largely because modern Null
hypotheses often address outcomes other than the difference between 2 means and involve multiple
explanatory variables. Furthermore, in the past it was common to see medical reports include multiple
pair-wise comparisons of random variables using a t test for each pair and without regard to how this
changes the alpha for Type I errors, but now this practice is less common. Still the t test can provide
a useful way to relate a continuous variable to a binary one (i.e. 2 groups), provided that the variable
is approximately normally distributed.

One Way Analysis of Variance (AOV)
AOV provides a way to test the Null that means of a normally distributed random variable y are the same
across several levels of a categorical variable x. For example, the following boxplot shows the
distribution of Log(PSA) for several Gleason Scores. Visually, it looks as if there is little difference
in PSA in the different Gleason Scores, and the one way aov confirmed that there was no reason to reject
the Null (F = 0.4, p = 0.5). (Note, however, that the several have documented that when PSA is divided
by tumor volume, this ratio decreases with Gleason grade.)


Wilcoxon Test
The Wilcoxon test is a non-parametric one, but otherwise it deals with the same data situation as the t
test. In other words, it is appropriate for the Null that a continuous variable is the same for 2 groups
of patients, and it can deal with paired or non-paired data, although the software requires that this
information be provided. If we symbolize the variable from the 1st group as y1 and the
variable from the 2nd group as y2, then what the Wilcoxon test does first is to rank the
values of y1 and y2 collectively in one virtual row. Then it forms the sum, Ry2, of the ranks of y2.
When Ry2 is either too large or too small, the Null is rejected. This test is also equivalent to the
Mann-Whitley test based on the calculation of a Uy2 statistic, which provides the number of times the y2
variable is larger than the y1 variable. In the PSA data used above for the t test, the Wilcoxon test
for the Null that the PSA values were the same for those with and without extra-capsular tumor produced a
z statistic of -1.9 and a p value of 0.057. Note that this p value is larger than what the t test
produced and that no logarithm transformation was required.

Kruskal-Wallis Test
Just as the Wilcoxon test is the non-parametric counterpart to the t test, the Kruskal-Wallis test is the
non-parametric counterpart to the aov. This test is for the Null that the values of a continuous y
variable in several categories of an x variable are the same. Like the Wilcoxon test, the Kruskal-Wallis
test orders the values of y along a single virtual row and then sums the ranks for each category of x. It
then computes an H statistic based on the sum of squared values of these ranks divided by the number of
patients in each x category. If k is the number of categories of x, then H has an approximate chi-square
distribution with k-1 degrees of freedom, so that the final test statistic is a chi-square. In the PSA
data used above for the aov above, the Kruskal-Wallis test for the Null that the PSA values were the same
for the different Gleason scores produced a chi-square statistic of 2.7 and a p value of 0.75, so that
once again the Null cannot be rejected.

Correlation
Correlation deals with how closely one continuous variable, y, is related to a second continuous
variable, x. The most commonly used measure of correlation is the Pearson correlation coefficient, r,
which can vary from -1 (negative, perfect correlation) to +1 (positive, perfect correlation). When r =
0, there is no correlation. The Pearson correlation coefficient is closely related to linear regression
and is calculated by relating values of y and x to their respective means. The Spearman correlation
coefficient uses the same algorithm but is performed on the ranks of y and x rather than on the raw data,
and it also has the same range and interpretation. Consider the following plot relating pre-operative
serum PSA to prostate tumor volume as measured by morphometric technique in the prostatectomy specimen
(Humphrey et al data).


Clearly, this plot demonstrates that the relationship between PSA and tumor volume is noisy. The Pearson
correlation coefficient was 0.55, and the Spearman coefficient was 0.23. The square of the Pearson
correlation coefficient estimates the fraction of the deviation explained by the regression relationship,
and for this example it was just 30%.

Linear Regression
Linear regression analysis seeks to model the value of a continuous variable, y, in terms of one or more
x variables, so that y can be expressed by the equation:

| y = α + β1*x1 + β 2*x2 + β 3*x3 +
| (50) |

Here, α is an intercept coefficient and the β are weighting coefficients to be multiplied times
the values of the respective x variables. The y variable can be a transformed value of y such as log(y),
and the x variables can also be transformed. If we symbolize the right hand side of the above equation
as f(α,β1, β2, β3,
), then the linear regression algorithm estimates the values of
α,β1, β2, β3,
by using calculus and linear algebra to minimize the square of the
deviations between y and f(α,β1, β2, β3,
) over all the patient observations. In
other words, the algorithm estimates coefficients so that

| Sum of Squared Deviations = Σ [yi fi(α,β1, β2, β3,
)]2 | (51) |

is minimized across all the patients (indexed here by i). If the residual errors are normally
distributed, then t statistics can be used to test the Nulls that the individual α,β1, β2,
β3,
etc. are equal to 0. If the p value for any of these coefficients is low, then one rejects the
Null and concludes that that particular x is significantly related to y.

For example, consider once again Humphrey et al.'s data relating tumor volume in the prostatectomy
specimen to pre-operatively measured variables, including serum PSA, and which was illustrated in the
last plot. Linear regression analysis demonstrated that factors in addition to PSA were important. PSA
density and Gleason score did not relate to tumor volume (p>0.10), but the 4 pre-operatively measured
variables in the following table combined to significantly relate to post-operatively measured tumor
volume in the whole prostate.

| Variable | Coefficient | t | p value |
| PSA | 0.202 | 7.0 | 0.000 |
| Tumor Length | 0.124 | 2.7 | 0.0076 |
| Exp(no. Pos. Cores) | 0.048 | 6.0 | 0.000 |
| Age | -0.0082 | -2.0 | 0.049 |

The number of positive cores is abbreviated "no. Pos. Cores", and the reason the exponential
transformation was used was because a preliminary plot of tumor volume versus the number of positive
cores suggested an exponential relationship. The intercept parameter α was not significantly
different from zero and so was dropped. The following histogram demonstrates that its residuals
approximated a normal distribution, so that it appeared safe to rely on normal theory for the results.


This time, the combination of these variables explained 76% of the variation in the data, but this still
left too much noise to encourage its use in a clinical setting, as demonstrated in the following plot.
Here, tumor volume predicted by the model appears on the horizontal axis, and observed tumor volume
appears on the vertical axis.


Because there is so much residual noise in the plot, it demonstrates that models and variables with low p
values and high levels of significance may not be sufficient to produce a clinically useful model.

The General Linear Model (GLM)
The GLM is analogous to the linear model except that GLM does not assume that the residuals are normally
distributed. Nevertheless, GLM does assume that the variance of the dependent random variable, y, is
constant. The GLM has the same linear form as in linear regression:

| Y = α + β1*x1 + β 2*x2 + β 3*x3 +
| (52) |

However, Y now represents a transformation of the originally observed y variable, and the y variables may
be either continuous or categorical.

Instead of obtaining a least squares fit to the collected data of {y, x1, x2,
}, the GLM obtains
estimates of α and β coefficients that maximize a likelihood function, L, or its natural
logarithm ln(L). L is the sample distribution function and provides the probability of observing a
particular sample of patients, their outcomes y, and their x1, x2, x3,
values, given particular values
of the α and β coefficients (see equation 33). What GLM does then is to choose values of
α and β so that this sample probability is maximized. The logic follows the argument that if
we observed the collected sample, then its probability, L, must be relatively high. Consequently, we
select values of α and β to maximize L. The exact form of L depends on the nature of y. GLM
obtains its solutions for α and β coefficients through an iterative least squares fitting
procedure.

The Null for the GLM is one without using any of the x variables (or equally all their β are zero),
and the likelihood function for this Null is symbolized as Lo. To test the importance of one or more x
variables we compare the final maximized L using the variables to the initial Lo without the variables.
It turns out that -2*ln(Lo/L) has a chi-square distribution with the number of degrees of freedom equal
to the number of x variables used in the model, and the result is termed the "likelihood ratio test" for
testing the significance of one or more of the x variables, because it depends on the ratio Lo/L. The
next model discussed, the logistic model, provides an example. Others can be found in the McCullach and
Nelder text.

The Logistic Model
The logistic model deals with a dependent random variable, y, which is binary, that is, either 0 or 1,
and the transformation of y which is considered linear is the natural logarithm of the odds as follows:

| Log{odds (y=1) } = α + β1*x1 + β 2*x2 + β 3*x3 +
| (53) |

Because the odds (y=1) is defined in terms of probability P(y=1) as:

| Odds (y=1) = P/(1 P), | (54) |

the probability P can be written as:

| P(y=1) = 1 / {1+exp(-E) } | (55) |

with E given as:

| E = α + β1*x1 + β 2*x2 + β 3*x3 +
| (56) |

The logistic model can be used to test Null hypotheses that the x variables are not related to P(y=1) as
well as to model the relationship between P(y=1) and the x variables; however, Peduzzi et al. suggest
that for the logistic analysis to produce reliable results, the data should include at least 10 events
for each x variable with event being defined as the smaller number of those with either y=1 or y=0.

Using the Logistic Model to Compare Two Diagnostic Tests
The following example uses the logistic model to test whether serum levels of Troponin I add diagnostic
information to serum levels of CK-MB for the diagnosis of acute myocardial infarction (AMI). The data
come from a report by Chang et al. (1998) and comprise 110 patients. They considered both serum tests
as binary results, that is, as either negative or positive, and they reported the following table of
results that included sensitivity, specificity, PPV and NPV as percentages.

| Marker | Sensitivity | Specificity | PPV | NPV |
| CK-MB | 92.7 | 89.9 | 84.4 | 95 |
| Troponin | 90.2 | 95.7 | 92.5 | 94.3 |
| Both | 100 | 88.4 | 83.7 | 100 |

Although these results suggest that using both markers increases sensitivity and NPV, the entries of this
table do not provide us a statistical test of the importance of each marker for the diagnosis of AMI.
Nor do they tell how much weight to give to each of these 2 tests. A logistic regression analysis on
their reported data, however, provided greater information and produced the following table of results.

| Marker | Coefficient | p value |
| CK-MB | 3.47 | ~0.000 |
| Troponin | 4.18 | 2.2x10-7 |

The very low p values tell us that both CK-MB and Troponin I provide significant information about the
diagnosis of AMI and that Troponin adds information to what CK-MB provides. The magnitude of the
coefficients suggests that both markers are weighted approximately equally for their impact on the
diagnosis of AMI. Furthermore, the coefficients could lead to a diagnostic algorithm for applying to new
patients. For example, these results suggest that the probability of AMI (i.e. P(AMI) ) should be given
by:

| P(y=1) = 1 / {1+exp(-E) } | (57) |

with E given as:

| E = α + 3.41*CK-MB + 4.18*Troponin | (58) |

α would need to be adjusted to reflect the local overall prevalence of AMI in the local population
of patients for which cardiac markers are found to be negative. For example, if that prevalence for
those with negative CK-MB and Troponin I in the local emergency room is 0.02, then α would be
calculated as:

| α = ln (.02/.98) = - 3.89 | (59) |

Of course, before using such an approach, one would first need to validate it with a new set of study
patients.


Survival Analysis
Whereas logistic regression analysis deals with a single binary outcome, survival analysis deals with two
outcomes: a binary failure event like death and the time to the occurrence of that event. The data then
consists of the following categories:

Failure Event is 1 if it occurred at the last observed time, otherwise it is 0

Time of last observation, T

Explanatory variables, x1, x2,

The most commonly studied failure event in medicine is death, but any other binary failure event
connected with time can also be studied, and alternatives to death include tumor recurrence, metastasis,
and diagnosis of malignancy. In industrial settings the event might be time to failure of critical
equipment or a strike. Furthermore, the events need not be restricted to what we ordinarily consider
failures. For example, one could examine the event of cure and analyze the time to cure. Similarly, the
usual T variable is time, but it could be another positive, continuously increasing variable.

We assume, for the moment, that each patient is observed continuously for times less than the last
observed time. If the patient failed at the last observed time, then the value of the event is 1, and
the patient is said to be uncensored. If the patient has not failed at the last time, then the value of
event is 0, and the patient is said to be censored at the last time, or more specifically,
right-censored. One of the great strengths of survival analysis is its ability to deal with censored
patients, but there is also a cost. In general, the key results depend on and weight the uncensored
patients, so that data with many censored observations provide limited results. For example, Concato and
Peduzzi et al. warn that there should be at least 10 uncensored patients per explanatory x variable.

An example of survival data is the following table, which comes from recently published study by Vollmer
et al. regarding the serum marker CYFRA 21-1 in patients with advanced stage non-small cell lung cancer.
The purpose of this pilot study was to test whether a drop in serum CYFRA after the first round of
chemotherapy would be associated with subsequent survivalif so, then the change in CYFRA might be an
early measure of response to treatment. Three lines of the data are given as follows:

| Stage | Dead | T (months) |
| IIIB | 1 | 18.9 |
| IV | 1 | 3.1 |
| IV | 0 | 12.2 |

Here, the variable Dead was the binary event of observed death at the last time of observation, so that
entries of "1" here mean that the patient was followed to death. An entry of "0" means that the patient
was alive at last follow-up, and T gives the time of follow-up in months. 38 of their reported patients
had been observed until death and so were uncensored. Potential explanatory variables included stage,
clinical response to chemotherapy, initial serum level of CYFRA and the change in CYFRA after the first
round of chemotherapy.

Survival analysis addresses the survival probability, S(t), which is defined as the probability that the
time of failure, T, will exceed t. Specifically,


The Kaplan-Meier method derives estimates of S(t) for each t, and the resulting plot gives the estimated
S(t) versus t. For example, the following Kaplan-Meier plot provides S(t) vs. t for the CYFRA pilot
study.


The estimate of S(t) is the dark, stairstep solid line and the lighter dashed lines provide upper and
lower 95% confidence limits for the estimate. Characteristically, S(t) is 1.0 at t=0, and for increasing
values of t, S(t) becomes progressively smaller. The vertical steps down are time points when one or
more deaths occurred; and the vertical bars along the graph show time points when patients were censored,
that is, when they were not followed any further. In general such censored patients are used for the
denominator of the probability but never for the numerator. In this particular plot we see that after 20
months of follow-up the expected survival for these patients with advanced stage non-small cell carcinoma
was zero, although a few were still alive. Also characteristic is the way the confidence limits increase
as time increases, because for later times the number of observed events (i.e. uncensored patients) often
becomes small.

What much of survival analysis does is to test whether certain sub-groups of patients, or their
explanatory variables, account for different survival probabilities. For example, in the study of serum
CYFRA in advanced non-small cell carcinoma, we wanted to see if changes in this serum marker after the
first round of chemotherapy related to subsequent survival. If this happened, then such changes could
constitute an early measure of response to treatment, that is, one which is known before all courses of
the treatment are completed. The following four Kaplan-Meier plots demonstrate respectively how clinical
stage (upper left), clinical response (upper right), initial level of CYFRA (lower left), and drop in
CYFRA after the first treatment (lower right) affect survival.


These 4 curves demonstrate that whereas clinical stage and clinical response tend to produce overlapping
Kaplan-Meier curves, the initial level of serum CYFRA and a drop in CYFRA of at least 27% after the first
round of treatment tend to produce survival curves that separate from one another.

These 4 Kaplan-Meier curves represent a graphical univariate analysis of survival time. The most popular
univariate statistical test that matches this graphical analysis is the log-rank test. For example, in
the above 4 Kaplan-Meier plots, the patients were broken into 2 or 3 groups according to, respectively,
clinical stage, clinical response, initial serum level of CYFRA, and magnitude of drop in CYFRA after
first treatment. The log-rank test compares the observed survival for each of these groups against that
predicted from the entire data without the groups (e.g. from the previous Kaplan-Meier plot for this
study), and as before this comparison, summed over groups, results in a chi-square statistic. For the
above 4 univariate comparisons the log-rank tests revealed p values of: > 0.2, 0.27, 0.17, and 0.21.
Thus, by univariate analysis none of these 4 variables related to survival time.

The Calculus of Survival Probability and Hazard Function
The shapes of Kaplan-Meier survivial curves suggest an exponential type of function for S(t) like:


where h is called the hazard function. For example, the following graph shows the plots of 3 S(t) curves
with the 3 h values of 0.04, 0.085 and 0.20.


The middle plot shows a survival curve very close to the observed Kaplan-Meier curve for the CYFRA study,
and its value of h is 0.085. The top curve has a lower value of h of 0.04, and the bottom curve has a
higher value of h of 0.20. These 3 curves demonstrate a general phenomenon for survival curves. Namely,
lower hazard values imply higher survival curves and longer survival times, and higher hazard values
imply lower survival curves and shorter survival times.

Regardless of whether S(t) follows an exponential distribution like equation 61, the hazard function can
be defined as a conditional probability, namely, the instantaneous probability of failure in a small time
interval about t, given no failure up to time t. Specifically,

 
Thus, h is a differential. The numerator is the probability density of 1 S(t) given survival to time
t, so that using the definitions of S(t), a differential in calculus and conditional probabilities, h(t)
can be rewritten as:

 
Integration of both sides implies that the survival probability in general relates to the hazard function
as follows:

 
with u being a variable of integration and with the integration limits occurring from 0 to t.

The hazard function and the above details relating h to S(t) are important, when we want to perform an
analysis of survival time that utilizes or tests multiple explanatory variables, although the software
packages take care of the details of calculus for us. Probably the most popular method of multivariate
survival analysis is the Cox model.

The Cox Model
Cox introduced his method in 1972, and now over 30 years later it is used uncountable times each year.
The Cox model is designed to analyze survival time and to relate it to multiple explanatory variables,
symbolized once again as x1, x2,
, etc. The model assumes that the hazard function for any patient is
proportional to an unspecified baseline hazard, h0(t). Thus, the hazard, h(t), can be given by the
following equation:

| h(t) = h0(t) * exp( b1*x1 + b2*x2 + b3*x3 +
) | (65) |

The x1, x2, x3, etc. provide the values of the explanatory variables for the patient, and the b1, b2, b3,
etc. are fixed coefficients, whose values are estimated by the Cox model software. Because the baseline
hazard function, h0, is left unspecified, the Cox model is considered semi-parametric.

An important assumption of the routine Cox model is that neither the coefficients, b1, b2, b3, etc., nor
the values of the explanatory variables, x1, x2, x3, etc., change with time. Thus, the exponential term
in equation 65 can be removed from inside the integral and S(t) can be rewritten as:

 
Notice also that equation 63 can also be written using the natural logarithm as:

| log(h/h0) = b1*x1 + b2*x2 + b3*x3 +
| (67) |

Although this looks like a linear regression equation, the exponential functions here and in the
expression for survival distributions imply that this equation cannot be solved using straightforward
least squares techniques. Instead, the Cox model obtains solutions for b1, b2, b3, etc. through an
iterative process that maximizes a partial likelihood function. In practice this process has been found
to be robust. Furthermore, with modern desk-top computers, the analysis executes in seconds. In
addition to estimating the values of b1, b2, b3, etc., the Cox algorithm also estimates their standard
errors, obtains p values for the Null hypotheses that their values are 0, and estimates the overall model
likelihood ratio.

To illustrate, consider once again the CYFRA pilot study. Whereas, the four previously examined
variables yielded relatively high p values by the log-rank test, the univariate approach examined just
one variable at a time while leaving the other three uncontrolled. The following table demonstrates how
these 4 variables related to hazard (and thus via equation 62 to S(t) ).

| Variable | Coefficient(b) | se(b) | p value |
| Stage IIIa | ------ | ----- | 0.14 |
| Stage IIIb | ------ | ----- | 0.16 |
| Clinical Response | ------ | ----- | 0.18 |
| CYFRA Level | 0.61 | 0.15 | 0.00003 |
| >27% Drop in CYFRA | -1.14 | 0.41 | 0.005 |

(The coefficients and standard errors are omitted for the non-significant variables.) In other words
after controlling for the varying mass of tumor with the surrogate variable of initial serum level of
CYFRA, the study found that those whose CYFRA level dropped > 27% after the first round of
chemotherapy experienced a significant improvement in survival. These results implied that chemotherapy
was helping if CYFRA dropped by > 27%, but the results also implied that those who did not experience
such a drop were at a disadvantage.

One way to utilize the results of a Cox analysis with more than one significant explanatory variable is
to form a composite hazard score (HS). For example, the results of the CYFRA study suggest the following
HS.

| HS = 0.61*log(CYFRA) 1.14* R | (68) |

where CYFRA is the serum level before the first treatment and where R is a dummy variable that is 1 if
there was > 27% drop in CYFRA after the first treatment and otherwise 0, and where the coefficients of
the Cox model are multiplied times these variables. Thus, HS comprises information about the mass of
tumor as well as about how the tumor initially responds to chemotherapy. Using the cutpoint of HS=0.284,
the following Kaplan-Meier plot shows the expected survival for those with higher HS (lower curve) vs.
those with lower HS (upper curve).


For the patients in the group with lower HS, continuing the chemotherapy to additional courses seems
prudent, because they appear to benefit. However, for the patients in the higher HS group, one should
consider either discontinuing or changing the chemotherapy.

Testing the Assumptions of the Cox Model
One of the assumptions made in the routine Cox model is that the proportionality in the hazard does not
change with time, that is, neither the coefficients nor the explanatory variables change with time in
equation 65. Thus, it is prudent to see if this assumption holds. In S-PLUS this can be done both
visually as well as with a chi-square statistic using the cox.zph function. Nevertheless, Therneau and
Grambsch suggest that the routine Cox model often yields adequate models even when its assumptions do not
strictly hold. If the assumptions are violated more significantly, then one may have to use a Cox model
with time-dependent covariates, an alternative time scale, or one of several parametric models such as
the exponential or Weibull models.

Portability of Cox Model
If one has developed a Cox model using large numbers of patients and has validated it with new data, then
it is reasonable to use the model to predict outcomes in new patients. The proportional hazard model
implies that the survival probability for patient no. 1, i.e. S1, relates to the baseline survival
function S0 as follows:


where h1 is the hazard for patient no. 1, h0 is the baseline hazard and (h1/h0) is the exponent to which
S0 is raised. However, the same relationship also holds for any other patient, arbitrarily designated
here as patient no. 2, so that


But equation 68 can be re-written as:

| S2 = S0(h1/h0)*(h2/h1) | (71) |

implying that


Equation 70 says that if one knows the survival for patient no. 1, then the survival for patient no. 2
can be calculated from the ratio of their hazards. This ratio can be evaluated using equation 65 as
follows. If there are 3 prognostic variables with values of y1, y2, y3 in patient 2 and x1, x2, and x3
in patient 1, then

| log(h2/h1) = b1*(y1 - x1) + b2*(y2 - x2) + b3*(y3 - x3) | (73) |

Thus, the Cox model allows one to estimate survival probability, S2, in a new patient so long as the
coefficients b1, b2 and b3 are known, S1 is known and the 3 key variables are known in the original as
well as in the new patient.

The following example illustrates this approach. In 1985 Soong published details of a Cox model for
malignant melanoma (see Soong, 1985). For those of pathologic stage I and who had not had elective lymph
node dissection, he reported that three variables were important to survival: tumor thickness,
ulceration and tumor location (axial vs. other). He gave the coefficients for these 3 variables, and he
also supplied 4 survival curves for 4 prototypic patients, whose values of thickness, ulceration and
tumor location were also reported.

In 1986 Phillips et al. reported their observations on 21 cases of minimal deviation melanoma. Their
patients tumor thickness ranged from 1.6 to 10.4 mm, and the follow-up ranged from 18 to 96 months. A
Kaplan-Meier survival curve of their data is given below.


The question of interest is whether the observed long survival times for these 21 patients with minimal
deviation melanoma also occur for patients with ordinary melanoma. To evaluate this question, we used
Soong's model, because both sets of data included the tumor thickness, presence of ulceration and
location of tumor. Soong's reported coefficients for thickness, ulceration and axial tumor location were
respectively 0.4844, 0.8524 and 0.5004. We then matched each of Phillips' 21 patients to one of Soong's
prototypic patients with survival curves so that gender and age were close. We calculated the ratio
h2/h1 using equation 71 and the observed values of tumor thickness, ulceration and tumor location for
both the new patient and the prototypic patient. Finally, we calculated the expected probability of
surviving to the observed last time of follow-up for each of the minimal deviation melanoma patients
using equation 70. All this analysis assumed that the 21 patients had ordinary melanoma.

The result was that the probability of surviving to the time of follow-up in most of these patients was
high assuming their survival was like ordinary melanoma. The mean probability of surviving to the
observed time was 0.65 (range from 0.2 to 0.88), and 16 of the survival times (76%) had better than even
odds of being observed if the tumors were ordinary melanoma. Just 3 of survival times were less probable
for ordinary melanoma, because their survival probabilities were 0.20, 0.27 and 0.32. Thus, the majority
of reported patients with minimal deviation melanoma had a survival time much like ordinary melanoma, and
this result does not support the hypothesis that minimal deviation melanoma has a better survival time
than ordinary melanoma.

Laboratory Test Turnaround Time (TAT)
Clearly, one of the key variables we monitor in the laboratory is the turnaround time (TAT), that is, the
time lapse between receiving a specimen and reporting its results. Like survival time, TAT is always
greater or equal to zero. Unlike the situation in survival time, we view the reporting of a test as a
positive event, not as an event of failure. Nevertheless, the approaches used to study survival times
can be applied to TAT if we simply view the time of reporting the test as the end of its "life" in the
laboratory.

Consider an example of TAT for surgical pathology cases during a two month period of time. The
Kaplan-Meier plot for this data appears below:


One can then study and plot the TAT by method, or by pathologist. The following is a Kaplan-Meier plot
of TAT in surgical pathology for two different pathologists:


One had a mean TAT of 1.25 days, and the other had a mean TAT of 1.5 days, a difference which was
significant (p=0.00006 by log-rank test) but which was of little importance in magnitude.

Analysis of Lymph Nodes in Colon and Rectal Carcinoma
It has now been well established that patients with lymph node metastasis in colorectal carcinoma do
worse and benefit significantly from chemotherapy. In this regard pathologists know that it is important
to find as many nodes as possible in excised specimens of colon and rectum, because the more nodes one
examines, the more likely some will be found to be positive. This is simple logic, and this notion is
supported by recent observations that among those patients with N0 stage (i.e. all nodes were found to be
negative) those with fewer nodes examined survived a shorter period of time. The result implies that
when few nodes are found, there is a significant chance that metastases are missed. In other words, the
stage assignment of N0 in such patients is misleading. Although the number of 13 nodes has been
repeatedly recommended, we know that sometimes there are not 13 nodes present, no matter how carefully we
dissect the specimen. In fact, the number of lymph nodes found not only depends upon the pathologist.
It clearly also depends on how much tissue the surgeon removes. The question to consider is whether a
probability analysis can help us recognize N0 patients who are at risk of having metastases that were not
found.

The Poisson probability function provides a theoretical model that is appropriate for counting positive
lymph nodes in colorectal specimens. Specifically, it gives us an equation for the probability P of
finding x nodes positive when n total were examined. This equation is given as:

 
Here, x! stands for x factorial. α gives the
underlying probability that any node is positive for tumor, and it can be estimated from the relative
frequency of nodes that are positive. Undoubtedly, α reflects individualistic factors involving the
patient and most likely their T stage of tumor. For example, using 157 patients with colorectal
carcinoma and lymph nodal metastases, we found that the estimated value of α was significantly
related to T stage (p=1.4x10-9). The values of α for T stages T1, T2, and > T2 were respectively
0.091, 0.125, and 0.304. The following plot illustrates how the Poisson function matches our intuitive
understanding because it demonstrates that the probability of finding at least one node positive for
tumor rises with the number of lymph nodes examined.


If α > 0, then the patient is presumed to have metastases, whether or not we find them. If, by
contrast, α = 0, then the probability of finding metastases is 0, regardless of how many lymph nodes
we examine. Thus, α provides the underlying, hidden state of nodal metastases for any given
patient. Although , α is hidden, we can observe the number of positive nodes, x, and the total
number of examined nodes, n, and we can also estimate α as the ratio of x to n.

Suppose that for a given patient, α > 0, that is, there are some lymph node metastases present.
The Poisson model predicts that the probability of finding no metastases in this situation, that is, the
probability of finding a false N0 state decreases as the total number of examined nodes increases. This
result fits our intuitive understanding and is illustrated in the following plot.


In these ways the Poisson probability function seems appropriate for understanding some aspects of lymph
nodal in colorectal carcinoma; however, the next issue to consider is how to use the Poisson model when
we have discovered no nodes positive after examining n nodes. Here, the Bayes' theorem can help.

In this situation, the Bayes theorem will be used to estimate the conditional probability that the
patient has nodal metastases when, in fact, none were found. Specifically, we use the Bayes theorem here
to calculate the conditional probability that α>0 (remembering that when α>0 metastases
are present), when x = 0. The Bayes theorem is then written as:

 
P(Mets) is the a priori probability of lymph nodal metastases, which can be estimated from prior studies
of colorectal carcinoma, or from a current set of data. The Poisson formula (equation 74) provides
P(x=0|α>0), and the Poisson equation also indicates that P(x=0|α=0) is 1.0. In this way the
Bayes formula can be simplified to:

 
We used a study set of 213 patients with colorectal carcinoma and negative lymph nodes to test the
importance of the Bayes formula for P(Mets|x=0). We estimated the odds of no metastases from our data
according to T stage and found that these odds were respectively 5, 3.24 and 0.916 for T stages T1, T2
and > T2. We used the T stage specific values for α as before, and then calculated values of
P(Mets|x=0) for each patient. These values ranged from 0.0 to 0.45 with a median of 0.10. A Cox
survival model analysis demonstrated that the calculated P(Mets|x=0) was significantly related to overall
survival (p=0.0006) and that T stage contributed no further prognostic information (p>0.8). The
following plot demonstrates that those with P(Mets|x=0) > 0.16 had a shorter survival time (median 5
years vs. 8 years for the others).


In this fashion the combination of the Poisson model and the Bayes model provide a tool that can
segregate those with apparent N0 status into two groups: one with lower and one with higher risk for
earlier death.

References
Probabilities
- Billingsley P. Probability and measure. New York. John Wiley & Sons. 1995.

- Friestedt B, Gray L. A modern approach to probability theory. 3rd Ed. Boston. Birkauser. 1997.

- Vecchio Tj. Predictive value of a single diagnostic test in unselected populations. NEJM 1966;274:1171-1173.

- Galen RS, Gambino SR. Beyond normality. The predictive value and efficiency of medical diagnosis. New York. John Wiley & Sons. 1975.

- Beck JR, Shultz EK. The use of the relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 1986;110:13-20.
General Statistics
- Casella G, Berger RL. Statistical inference, 2nd Ed. Pacific Grove, CA. Duxbury. 2002.

- McCullagh P, Nelder JA. Generalized linear models. 2nd Ed. London. Chapman and Hall. 1989.

- Venables WN, Ripley BD. Modern applied statistics with S-PLUS. 3rd Ed. New York. Springer. 1999.
Logistic Model
- Hosmer DW, Lemeshow S. Applied logistic regression. 2nd Ed. New York. John Wiley & Sons. 2000.

- Harrell FE, Jr. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York. Springer. 2001.

- Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373-1379.
Survival Analysis
- Cox DR, Oakes D. Analysis of survival data. London. Chapman and Hall. 1984.

- Therneau TM, Grambsch PM. Modeling survival data. Extending the Cox model. New York. Springer. 2000.

- Hosmer DW, Lemeshow S. Applied survival analysis. New York. John Wiley & Sons. 1999.

- Harrell FE, Jr. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York. Springer. 2001.

- Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. J Clin Epidemiol 1995;48:1495-1501.

- Peduzzi P, Concato J, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995;48:1503-1510.
Data Used in Analyses
- Babaian RJ, Mettlin C, Kane R, et al. The relationship of prostate-specific antigen to digital rectal examination and transrectal ultrasonography. Cancer 1992;69:1195-1200.

- Catalona WJ, Hudson MA, Scardino PT, et al. Selection of optimal prostate specific antigen cutoffs for early detection of prostate cancer: receiver operating characteristic curves. J Urol 1994;152:2037-2042.

- Brawer MK, Aramburu EAG, Chen GL, et al. The inability of prostate specific antigen to enhance the predictive value of prostate specific antigen in the diagnosis of prostatic carcinoma. J Urol 1993;150:369-373.

- Morgan TO, Jacobsen SJ, McCarthy WF, et al. Age-specific reference ranges for serum prostate-specific antigen in Black men. NEJM 1996;335:304-310.

- Van der Poel HG, Boon ME, van Statum P, et al. Conventional bladder wash cytology performed by four experts versus quantitative image analysis. Mod Pathol 1997;10:976-982.

- Humphrey PA, Baty J, Keetch D. Relationship between serum prostate specific antigen, needle biopsy findings, and histopathologic features of prostatic carcinoma in radical prostatectomy tissues. Cancer 1995;75:1842-1849.

- Chang C-C, Ip MPC, Hsu RM, Vrobel T. Evaluation of a proposed panel of cardiac markers for the diagnosis of acute myocardial infarction in patients with atraumatic chest pain. Arch Pathol Lab Med. 1998;122:320-324.

- Vollmer RT, Govindan R, Graziano SL, et al. Serum CYFRA 21-1 in advanced stage non-small cell lung cancer: an early measure of response. Clin. Cancer Research 2003;9:1728-1733.

- Soong S-J. A computerized mathematical model and scoring system for predicting outcome in melanoma patients. in Balch CM, Milton GW. Cutaneous melanoma. Philadelphia. JB Lippincott Co. 1985. chapt. 20. pp 353-371.

- Phillips ME, Margolis RJ, Merot Y, et al. The spectrum of minimal deviation melanoma: a clinicopathologic study of 21 patients. Hum Pathol 1986;17:796-806.
|

|
|