—  SHORT COURSE #62  —

Introduction to Statistics for Pathologists
with Emphasis on Evaluating Laboratory Tests
Robin T. Vollmer, MD, MS


Click here to download handout in 1-per-page format (431KB)

Click here to download handout in 6-per-page format (258KB)


Probability
Without probability theory there could not have been any of the techniques of statistical analyses. And for the most part without p values, which are probabilities, and more specifically without small p values there can be no publications. Thus, the beginning and the end must deal with probability.

So what is probability? There are at least 3 notions of it: one a little base, one based in labor and detail, and one lofty and abstract. The beginning of probability theory had to do with gambling, more specifically questions raised by a 17th century gambler to his friend, a French mathematician. When we speak of probability as the chance of some event, we are historically correct, and modern probability still deals with issues of chance and gambling. The second notion deals with scientific experimentation. If in a series of experiments we observe how often a particular event occurs and divide that number by the total number of experiments, then this ratio is an estimate of the probability of the event. In both notions the probabilities comprise numbers between 0 and 1, 0 for an event that never may occur and 1 for an event that is certain to occur. Part of the third notion of probability, then, is a function whose range is a real number on the interval from 0 to 1.

The third notion involves set theory and abstract mathematics. Here, a probability space is a triple (S,B,P), and you cannot speak of the probability function P, without also referring to two sets: S and B. S is the sample space or a set whose elements comprise the outcomes of an experiment, and B is a set of subsets of S and with particular features. First, each member of B is called an event. In addition, B must include the null event φ. In other words, B must include the event with probability of 0. Second, B must be closed under complementation and countable unions. In other words, if an event D+ is a member of B, then not D+ or D0 must also be a member of B. And if D1 and D2 are members of B, then the set {D1 or D2} must belong to B. In mathematics, these features make B a σ field and (S,B) a topological space. Finally, P is defined so that P(S) = 1.

An example will help make these ideas less abstract. S is a set whose elements comprise patients in a study. D+ is the subset of the patients in S who have a particular disease, so we make D+ an event in B. Because those without the disease must also be included in B, the subset D0 is in B. Now suppose that patients in S comprise a collection with two grades of the disease (1 and 2). If D1 and D2 are events in B, then the event either D1 or D2 (i.e. either grade 1 or grade 2) must also belong to B. Then, with these building blocks completed we can speak of the probability of a patient having D+, D0, D1 or D2.

What do these mathematical details buy? They lead to some interesting properties of probability spaces, some of which have direct clinical application, and as illustrated above they lead to p values upon which we depend. The summary properties of probability are:
P(S) =1 (1)
P(φ) =0 (2)
P(D+) = 1- P(D0) (3)
P(D1 or D2) = P(D1) + P(D2) - P(D1 & D2) (4)

Two events, D1 and D2, are mutually exclusive if the probability of observing both is zero, i.e. if P(D1 & D2) = 0.

Two Events, D (Disease) and T (Test) and Conditional Probabilities
Let us begin with the sample space of patients. Now consider two events. The first is D+, the presence of a specific disease. The second is T+, the presence of a positive test. The probability of observing D+, i.e. P(D+), is the prevalence of the disease. The probability of observing a positive test is P(T+) or the prevalence of a positive test.

The probability of observing one event D+ conditional upon a second event T+ also being observed is P(D+/T+), which is called a conditional probability. In medicine this particular conditional probability is the positive predictive value of the positive test T for the disease D. The mathematics of probability indicate that P(D+/T+) is calculated as:


A second way to think of this is that P(D+/T+) becomes the new P for a sample space when all patients have T+, because P(T+/T+) = 1. A third way to view P(D+/T+) is as part of a product. Specifically, the probability of simultaneous occurrence of two events D+ & T+ is the product of P(D+/T+) and P(T+):

P(D+ & T+) = P(D+/T+) * P(T+)(6)


The two events, D+ and T+ are said to be statistically independent when P(D+/T+) = P(D+) (or equally when P(T+/D+) = P(T+), and in this circumstance:

P(D+ & T+) = P(D+) * P(T+)(7)


This is the only true notion of statistical independence. More commonly the term independent is misused, as in, for example, a statement like, "tumor stage and grade provided independent prognostic information in the Cox model of survival time". This use of independent is most often incorrect, because in fact such prognostic variables are often closely related to one another, that is, equation 7 does not hold. For example, stage and tumor grade are seldom independent of one another, because the conditional probability of observing high stage given high grade is not equal to the probability of high stage, i.e. for most tumors

P(High Stage / High Grade) ≠ P(High Stage)(8)


Instead, for most tumors

P(High Stage / High Grade) > P(High Stage)(9)


Instead of independent information, what these two variables can provide in multivariable models is additive prognostic information.

Random Variables, Distributions and Distribution Functions
Random variables come into play when the events are numerical or when they can be mathematically mapped into numbers. For example, in the original sample space the elements might comprise patients (or their biopsies), but the data we use are numerical measurements like tumor grade, tumor size, and time to tumor recurrence. To see this more explicitly consider a typical clinical data set with the patients in rows and with their specific random variable observations detailed in columns as follows:

Patient x1 x2 x3 x4 x5 x6 x7
1 1 5.0 1 10.0 25. 20
2 1 4.5 1 9.5 30. 1 1
3 0 6.0 1 9.0 45 4 1


and so forth.

Here, the original sample space comprises patients 1, 2, 3, etc., and the x1-x7 are numerical measurements, which for the most part can be considered random variables. (In regression analyses we sometimes will consider them to be fixed measurements relatively insensitive to issues of chance and probability.) Key random variables for many medical studies are the outcomes like the presence or absence of a binary event (coded numerically as 0 or 1) or the time to some binary event such as death.

Random variables can be usefully classified into two types: discrete and continuous. Discrete random variables take only specific numerical values, such as 0 versus 1 (a binary variable) or 1, 2, 3, 4 and 5 (a categorical variable), although they may include an infinite number of values (e.g. 0, 1, 2, … ∞ ). An example of a discrete random variable is tumor grade.

The Calculus of Probability
The distribution function is the representation of probability for a random variable x. It is often symbolized abstractly as F(x), and when it refers to a specific value for x, like x=a, then it is symbolized as F(a). F(a) is defined as the probability that x takes a value less than or equal to a. In other words, F(a) is the probability of a specific interval [-∞ ,a] on the real line:

F(a) = P(x<=a)(10)


The probability density f(x) is related to F(x) through calculus. For example, the probability density f(x) at a point is a differential defined mathematically as


Thus, f(x) is the differential of F, because


If x is a continuous variable, then:


with the limits of the integration occurring from - ∞ to a. If x is a discrete random variable that can take only certain values, then integration becomes a sum at the discrete values of x, and


It is useful to think of f(x) as a histogram. For example, the figure below shows the observed f(x) for Gleason score in a series of 100 prostate needle biopsies (data from Humphrey).



By contrast continuous random variables can take any value on the real line from -∞ to + ∞. An example of the probability density for values of serum PSA in men without cancer is illustrated below. The data of Babaian et al. were used and fitted with the exponential distribution to result in exp(-PSA/2.28)/2.28. (Strictly speaking because PSA can only take values greater than or equal to zero, the distribution for PSA was arbitrarily set at 0 for PSA values less than 0, and only positive values are shown in the plot.)



This distribution demonstrates that for most men with negative biopsies, PSA is less than 5 ng/ml.

For many types of random variables, x, the f(x) comprise familiar mathematical functions. For example, the binomial distribution often works for a binary x. For continuous x's there are the normal, chi-square, exponential, F, t and gamma distributions.

Statistics
Typical data for a clinical-pathologic correlation comprise a series of patients (the original sample space) and their associated numerical data (the random variables), and rather than collecting all possible patients we stop with a sample of size n. If there is just a single outcome random variable, x, then such data would consist of the series: x1, x2, …, xn of numerical values. If the underlying distribution for x is f(x), then the sample we have collected is considered a random sample if, and only if, the values x1, x2, … , xn can be considered statistically independent from one another, that is, if the probability distribution of collecting this series is simply the product of observing each individual xi value. In other words,

f(entire series) = f(x1) * f(x2) * … * f(xn)(15)


This product is called a likelihood function, and it plays an important role in estimation of statistics as well as in test of hypotheses. Although we have assumed the patients and their random variable values, x1, x2, x3, etc. are statistically independent, it is easy to imagine how this might not be the case. For example, consider a multi-institutional clinical trial which records tumor grade as a random variable. Patients from one institution might have their tumors graded systematically different from another, simply because of grading preferences by institutional pathologists. Such patients would not then comprise a random sample with respect to the random variable grade of tumor.

One way to summarize the series {xi} is to calculate its mean. The mean is an example of a statistic, which is a real valued function on the series {xi}. The mean is also the expected value of the random variable, x, and its formula is:


The sample variance is the expected value of (x – mean)2 and is a measure of the dispersion of x around the mean. Its formula is:


The standard deviation is the square root of the variance. Another useful descriptive statistic is the median, which gives the midpoint of the range of x.

What mathematical statisticians have done over the past 100 or so years is to develop many practical results based on random variables and statistics. More recently, applied statisticians and software engineers have developed strong software packages which make many of these results and tools available to any of us with a desktop computer and data. As long as the fundamental rules regarding probability, random variables, and sampling are satisfied, we can then reach conclusions based on the analysis of the data.

Reference Limits for a Clinical Random Variable, x
Historically, we have referred to "normal" limits, and by this we meant the limits of x expected in normal persons. Now, however, it is less certain what comprises a normal population, or even if such a grouped result is an appropriate reference for clinical decisions. Nevertheless, the expected values of x for some reference group comprise a popular result, which clinicians often demand. Potential reference groups used include students, blood donors, laboratory technologists, hospitalized patients without the disease for which the test is used, etc. The use of any or all of these groups is problematic but dealing with such issues lies outside of statistics.

The orthodox reference question is: what is the range of values of x expected for 95% of the reference population? We seek a lower value xl and an upper value xu such that:

F(xl) = 0.025(18)


and

1 – F(xu) = 0.025(19)


In other words the probability of observing a value of x between xl and xu should be 0.95, or

F(xu) – F(xl) = 0.95(20)


The parametric approach to determining xl and xu is to first examine the frequency distribution of x in the reference population. If this distribution matches f(x) for a normal probability density, then

xl = mean – 2* standard deviation(21)


and

xu = mean + 2* standard deviation(22)


If the frequency distribution of log(x) in the reference population matches f(x) for a normal probability density, then

xl = exp{ mean(log(x)) – 2* standard deviation(log(x)) }(23)


and

xu = exp{ mean(log(x)) + 2* standard deviation(log(x)) }(24)


If the frequency distribution of x in the reference population matches f(x) for an exponential probability density, then

xl = 0(25)


and

xu = 3 * mean(26)


For example, consider Morgan et al.'s 1996 PSA data collected on 3475 men without prostate cancer and 1783 men with prostate cancer. PSA was given in unit values. Here is the frequency distribution of those without cancer.



Clearly, this distribution of PSA does not appear to be normal. Perhaps it is log-normal? We can test this visually by examining the distribution of log(PSA), and the following is what we get



Once again, it does not appear to be normal. The following shows how the frequency distribution of PSA in this data matches an exponential distribution (line on the plot).



This looks like a good match, so we conclude that PSA in men without prostate cancer appears to follow an exponential probability density. The 95% reference limits we get then are xl =0 and xu = 6.1 ng/ml (i.e. three times the mean of this data). Whenever we are dealing with a clinical test that always gives values > 0.0, we should consider the exponential probability density. It may be more appropriate than either the normal or log-normal density functions.

The non-parametric approach to reference limits is simple. Suppose we have 100 patients in the reference sample. We put their values of x in order from smallest to the largest. Then we count off the first 2.5% the x value at this point is xl, which in our case would be midway between the 2nd and 3rd persons values. Then we count from the bottom to the 97.5% (or equally 2.5% from the highest), and the value here would be xu, which in our case would be midway between the 97th and 98th persons' values.

The problem with reference limits is that they ignore important issues. For example, they deal just with the reference population, not the distribution of x in the diseased population. In other words, they deal with the conditional probability P(x outside range | reference pt.), but they do not address the conditional probability P(x | diseased pt.). For example, in the Morgan et al. data reference limits deal with just the upper distribution in the following plots, but not the lower.



Clearly, the lower distribution is important to the interpretation of PSA, especially in that overlap zone between 2 and 10 ng/ml, and in this regard reference limits deal with just half the problem.

Although reference limits can be adjusted for other variables that may be important, such as age and race for PSA, they do this awkwardly. Consider, for example, how the 95th upper limits can vary in the Morgan et al. PSA data because of age category and race.
Age Race 95th Percentile (ng/ml)
40-49 Black 2.4
40-49 White 2.1
50-59 Black 6.5
50-59 White 3.6
60-69 Black 11.3
60-69 Black 4.3

Clearly, the 95% confidence limits depend on age and race.



Clinical Conditional Probabilities (Dealing with the Diseased Population)


Whereas reference limits ignore the diseased population (except to exclude them), population conditional probabilities like the sensitivity, the specificity, the positive predictive and the negative predictive values of a test deal directly with the disease. If the diseased group is symbolized as D+, the non-diseased as D0 and if those with a positive test are T+ and those with a negative test as T0, then these values can be defined in terms of conditional probabilities as follows:

Sensitivity = P(T+/D+)(27)
Specificity = P(T0/D0)(28)
Positive Predictive Value (PPV) = P(D+/T+)(29)
Negative Predictive Value (NPV) = P(D0/T0)(30)


Both sensitivity and specificity are of special interest to pathologists, because they emphasize the test result as an outcome. They also suggest a natural experiment, namely the collection of diseased, D+, and non-diseased, D0, groups and the subsequent measurement of the test T on both populations. By contrast, PPV and NPV emphasize the presence or absence of disease as the outcomes, so that these two measures are of greater interest to clinicians and their patients. PPV and NPV also suggest an alternative experiment, namely, the collection of those with and without a positive test with subsequent assessment of both groups for the disease. A third, and perhaps best way to do the experiment is to prospectively collect a group of patients and assess them simultaneously for the test and the disease.

Regardless of which experimental design is used, the results can be tabulated as a 2x2 table as follows:

T+ T0
D+ True Positive False Negative
D0 False Positive True Negative

Henceforth we will symbolize these entries as TP, FN, FP and TN, and the sum of the entries (i.e. total number of patients) will be n. An example comes from Babaian et al's 1992 work on serum PSA in prostate cancer. Serum PSA was assayed by monoclonal antibody in 404 prospectively collected men who also underwent needle biopsy of the prostate. Babaian used the cutpoint of > 4 ng/ml as a positive PSA (i.e. T+) and presence of cancer in the biopsy as D+. His 2x2 table results are as follows:

PSA>4 PSA<=4
CA 48 21
Benign 58 277

By using the relative frequency of events as estimates for probabilities, we find that the sensitivity for this test is 48/(48+21) or 0.70. The specificity is 277/(58+277) or 0.83. The PPV is 48/(48+58) or 0.45, and the NPV is 277/(21+277) or 0.93.

The first issue to consider is whether or not the disease D+ and test T+ are statistically independent of one another. If they are independent, then the test is worthless for diagnosing the disease. From the rules of pathology, we know that two events, D+ and T+, are independent if and only if:

P(D+ & T+) = P(D+) * P(T+) (31)


Let begin with this assumption—it is called the Null hypothesis—and then test to see if it is reasonable. We estimate P(D+) as (TP + FN)/n and estimate P(T+) as (TP + FP)/n. Then we compare the TP with the number expected with the Null hypothesis, which is n*P(D+)*P(T+). For the TP entry of Babaian's table, the expected number from the Null is 18.1, and the observed is 48. These appear quite different. If we do the same comparison for the remaining 3 entries in the table and then sum the square of the differences between observed number and the expected number with the Null hypothesis, the result is a statistic that follows the chi-square distribution as long as the entries in the table exceed 5. For Babain's data, the chi-square was 78, and this implied a p value of 0.0. Consequently, we can safely reject the Null that prostate cancer and PSA are statistically independent from one another.

Test of the Null Hypothesis
The above was an example of statistical testing using the Null hypothesis (or for short just the Null). Whereas what we often seek is to discover whether a key variable is significantly related to an outcome, most statistical tests begin by assuming that there is no such relationship. The Null can take a variety of forms. For example, the Null might be that the means of a random variable from two populations are the same. In survival analysis the Null might be that the median survival time for the two populations is the same. In a regression analysis the Null would be that the coefficient relating the dependent variable to the independent variable is zero. Mathematical statisticians have solved the details for large numbers of tests of the Null, and programmers have incorporated the results into software, but the details of the Null can be hidden from general users of the programs. In general, the logic proceeds as follows. First, some specific assumptions are made about the random variables and their statistical distributions including the likelihood function, and it is also assumed that the observations collected on different individuals are statistically independent from one another. From this beginning, we additionally assume the Null hypothesis and then often use the likelihood function to estimate the value of one or more statistics, s, from the data. The software then provides the predicted probability of observing key values for s, given the Null hypothesis. That predicted probability is called the p value. Specifically, if x is the value we have obtained for s, then for a two-sided test:

p value = P( |s| >= a / Null) (32)


In other words, the p value is the conditional probability that the absolute value of s will exceed or equal a given that the Null hypothesis holds. If the p value is small, we reject the Null.

The classical approach formulates the ideas of the Null hypothesis into a truth table where the Null may or may not be true and where we make a binary decision of accepting or rejecting the Null after observing a value of s.

Null is True Null is False
Accept Null No Error Type II Error
Reject Null Type I Error No Error

In this table are two errors, Type I and Type II. If s has taken a value in some critical region like s >= a or s <= -a, then we reject the Null hypothesis. A Type I error occurs when the Null is true but the value of s has fallen into the critical region, so that the probability of making this error has the same form as the p value, i.e. P( |s| >= a / Null). Because no one likes making errors, in general the value a is selected so that the probability of a Type I error is small, typically <0.05 or <0.01, and this chosen level is symbolized as α. With this attitude neither the exact value of the statistic s nor the magnitude of its calculated p value count, that is, the continuous nature of the information provided by the calculated p value is ignored.

By contrast a Type II error occurs when the Null hypothesis is false, and the value of the statistic we measure leads us to accept the Null hypothesis. The probability of making a Type II error is symbolized as β, and 1-β is called the power of the test. To estimate either β or the power requires some knowledge about alternatives to the Null. If the alternative to the Null is a simple one like s takes a particular value outside the critical region, then usually β and the power can be determined. However, more often the alternatives to the Null are not simple. For example, the Null might be that a coefficient takes the value of zero, and the alternative would be that the coefficient has any one of a number of non-zero values. In general for a fixed sample size the probability of making Type I and Type II errors are inversely related to one another, that is, as α is chosen to be smaller, β will rise and vice versa. The only way to have both α and β small is with an adequate sample size. This leads to and supports the common sense notion that with large sample sizes, one can expect lower p values and lower probability of making Type II errors.

Determining an adequate sample size for simple statistical tests involving, for example, two binary random variables is relatively straightforward with published formulas and tables readily available. However, finding the appropriate sample size for more complex statistical models with multiple variables is not straightforward, so that one must rely on a variety of assumptions and sometimes sophisticated software. A few guidelines, nevertheless, can be given. 50 patients comprise a small study that will allow for analysis of several variables, and this size study may provide important preliminary details that allow further estimates of sample size and power to be calculated. If all 50 have hard outcomes, then such a study might be more powerful than one with 100 patients and missing data or censored patients. Studies with 100 to 500 patients will allow significant multivariable analyses as long as there are at least 10 to 15 outcomes for each independent variable in the analysis (Peduzzi, 1996). Whereas these rough rules should work for most studies in pathology, grant applications, clinical trials or more formal study proposals expect better estimates of sample size and power. To achieve such estimates often requires the help of a professional statistician and specialized software, and sometimes the estimates of sample size are greater than what is needed.

Finally, one additional generalization about tests of hypotheses and power seems to apply. Parametric tests that rely on a specific distribution for the data, such as the normal or chi-square distributions, are generally more powerful for any given level of α than their non-parametric counterparts.

P < 0.05?
A p value less than 0.05 has become nearly synonymous with statistical significance. In other words, with a p<0.05 we commonly reject the Null, because the chance that this result is due to random fluctuation within the Null is 1 in 20. Others may prefer an α of 0.01. But what about p<0.06, which implies that there was a 1 in 17 chance of the result occurring due to randomness? In fact, no choice of α is justified by either mathematical or statistical theory. Rather, the choice is arbitrary, as is any cutpoint in the continuous p. Here we face the antithesis between the discrete and the continuous. The binary decision of accepting or rejecting the Null is related to the continuous p value through a relatively arbitrarily chosen α level. I suggest that instead of being trapped in this dilemma we should step away from the classical paradigm of accepting vs. rejecting the Null and consider all the information provided by the magnitude and setting of the p value.

In general p values result from a statistic, like a chi-square, t value or an F, and each has its own distribution. Large values of chi-square, t or F correspond to low p values, and the magnitude of the statistic relates directly to the strength of the association between the variables of interest. Thus the p values and their corresponding statistics provide more information than simply whether or not to reject the Null. They tell us the strength of association with outcome and the order of importance of explanatory variables. For example, in a regression analysis, where we examine how a key outcome depends on one or more explanatory variables, the variables most strongly related to outcome will be the ones with the largest statistic and the lowest p values. These will also be the ones which explain more of the variance in the data.

The values of the statistics and their p values should also be considered in the context of the study. P values of 0.01 to 0.04 for variables in a study of 400 patients with many hard outcomes may not be very important, especially if they are expensive to observe and are poorly reproduced from one study to the next. When such variables have been selected from a larger pool by elimination based on the data, their statistics may be inflated due to what Harrell describes as phantom degrees of freedom, and it is not unusual for such variables not to be validated in repeat studies. By contrast variables in the same study with p values of < 0.0001 are likely to be validated in new studies, although often with larger p values. At the same time other variables with p values larger than 0.05 may be important. For example, a commonly known and cheaply observed variable like gender, which may have a p value of 0.06 – 0.10 could be important, especially if one wants to control for gender while testing the importance of other explanatory variables.

ROC Curves
In addition to the chi-square test for independence between a disease D and a Test T, the sensitivity and specificity derived from the 2 x 2 table lead to a graphical test for the importance of the test. This is the ROC curve, which is a plot of sensitivity on the vertical axis versus 1-specificity on the horizontal axis. To repeat, sensitivity is the conditional probability P(T+/D+), i.e. the probability of a positive test in a patient with the disease. Specificity is P(T0/D0), i.e. the probability of a negative test in a patient without the disease. Because specificity is a conditional probability, it follows the rules of probability. Thus, 1 – specificity is the probability of a false positive test or P(T+/D0). Specifically,

1 – P(T0/D0) = P(T+/D0) (33)


Because the sensitivity is the true positive probability or P(T+/D+), what the ROC curve does is to compare the true positive probability (i.e. sensitivity) against the false positive probability. An example follows.



On the plot are two studies regarding the use of serum PSA and the presence of prostate cancer on needle biopsy. The curved line shows the results obtained by Catalona et al., and the points are from the study by Brawer et al. Each bend in the line for Catalona's data and each point for Brawer's data represent different cutpoints used for defining a positive value of PSA. The straight line shows where P(T+/D+) would equal P(T+/D0). Any points falling along this line indicate when the true positive rate equals the false positive rate, a situation for which the test would be worthless, because a positive result would not change the odds for disease. For example, using the rules of conditional probabilities and straightforward algebra demonstrates that for points on the straight line:

Sensitivity = 1 – Specificity (34)


or         P(T+/D+) = P(T+/D0) (35)






Dividing both sides by P(T+) and algebraically rearranging the terms leads to:


This result implies that

Odds of Disease Given T+ = Odds of Disease Regardless of T (39)


In other words, when the sensitivity equals 1 – specificity, the test provides no useful information about the presence of D. All this flows from the definition of conditional probabilities and simple algebra.

By contrast an ideal test should have a sensitivity of 1 and a false positive rate of 0. In other words for an ideal diagnostic test, P(T+/D+) should equal to 1, and P(T+/D0) should equal to 0. Thus, the point for the ideal test should be in the upper left corner of the ROC curve. Furthermore, the farther the observed points appear from the straight line and the closer they appear to the upper left corner, the better the test is. Another way of saying this is that an ideal test should have a large area under the ROC curve.

The way the D0 population is identified is important and affects the result of P(T+/D0) and consequently the appearance of the ROC. For example, in both Catalona's and Brawer's studies, the D0 groups were defined as those with negative prostate biopsies or who did not have biopsies because of low values of PSA or normal digital rectal exam.

Morgan et al. defined the D0 group to include men who did not have biopsies but who were thought unlikely to have prostate cancer because of clinical and follow-up information. An ROC plot comparing the results of Catalona et al's study for which all underwent biopsy of the prostate with those of Morgan et al. follows. The line represents Catalona et al.'s ROC, and the points represent Morgan et al.'s study, and these two ROC plots demonstrate that PSA is a better test for a large population without indication for biopsy than it is for a population with indications for biopsy. The plots also demonstrate the importance of careful choice of D0 populations.



In addition to demonstrating how close, or far, a test's cutpoints may lie from the non-ideal line, the ROC curve can be used to compare the results of different tests or different observers. In the above figure the results for Catalona's group are reasonably close to those for Brawer's group suggesting that their results are comparable. The next figure shows the comparison between the ROC curve for PSA (line) versus that for PSA density (points) which is PSA divided by the prostate volume as measured by ultrasound.



The fact that the ROC curves appear close suggests that PSA and PSA density provide about the same diagnostic information.

Finally, the next ROC compares the diagnostic ability of several cytopathologists to detect atypical cells in urine cytology specimens. The data come from VanderPol et al.



In this ROC plot the points represent four different cytopathologists using different thresholds for an abnormal urine cytology specimen, and the gold standard for the study was obtained from cystoscopic exam, but not biopsy. Because some points lie close to the non-ideal line, urine cytology is not a helpful diagnostic test, at least for some thresholds and some cytopathologists. In summary, ROC curves provide a useful visual tool for comparing tests, for comparing different studies of a particular test, or for comparing the diagnostic ability of different pathologists.

Bayes' Rule
Whereas those evaluating a new diagnostic test often dwell on the sensitivity and specificity of the test, of more practical importance for patients and clinicians is the PPV of the test, i.e. P(D+/T+), or the NPV, i.e. P(D0/T0). These are the conditional probabilities that can tell them the chance of having, or not having, the disease, given a particular test result. Bayes' rule allows us to relate several of these conditional probabilities to one another and can be written for the specific sets D and T as:


P(D+) is the prevalence of the disease, so that the positive predictive value, P(D+/T+) can be seen as:


In practice, the easiest part of this formula to estimate is the sensitivity of the test. All one need do is collect a number of patients with disease and see how many have a positive test. More difficult to know are the specificity and prevalence of the disease. To estimate specificity and prevalence, one must collect larger numbers of patients who do not have the disease, or at least collect a sufficient number without the disease and for whom the test might be applied. Estimates of both specificity and prevalence reflect the make-up of the D0 sample. Consequently, the PPV for a test may have different values for studies with different D0 groups, and failure to collect sufficient numbers and types of D0 patients can also yield misleading results. What probably is the trickiest situation is when the published positive predictive value came from a study with one D0 group and is now applied to some broader, or different D0 population. For example, consider the sensitivity and specificity of serum PSA for detecting prostate cancer. Early studies published the sensitivity and specificity of various cutpoints of PSA on patients, all of whom underwent needle biopsies of their prostate. Consequently, the D0 sample comprised men who had some clinical indication for the biopsy, such as an enlarged prostate, obstructive symptoms or palpable abnormality. Now, PSA is used as a screening test on men, who are mostly asymptomatic and will not undergo biopsy.

An Application of Bayes Rule to the Diagnosis of Prostate Cancer
Consider the positive predictive value of prostate cancer, given that serum PSA exceeds a certain threshold value of x ng/ml. Here, P(D+ | T+) is written as P(Ca | PSA > x), i.e. the probability of prostate cancer (Ca) given that PSA exceeds x. Bayes rule tells us that this can be written as:


where B9 symbolizes the benign population. Recognizing P(PSA > x | Ca) as the sensitivity and P(PSA > x | B9) as the false positive test probability (FP) and dividing both the numerator and denominator of equation 42 by the product in the numerator simplifies the result to:


I consolidated published results from the literature for age and race specific sensitivities and specificities of serum PSA (over 2700 men with prostate cancer and over 99,000 without prostate cancer), fitted these values with distribution functions (exponential for FP and a combination of gamma and exponential for sensitivity), and obtained an age and race specific value of P(Ca) from the SEER data. Then I used these results in equation 43 to estimate the positive predictive value of prostate cancer. Here, are the plotted results for black men of several age groups.



The plots demonstrate that the positive predictive values rises with cutpoint in serum PSA as expected, that the lines for several age groups are overlapping, but that the positive predictive value is highest for oldest men in the low values of PSA and highest for young men for the highest values of PSA. These differences at each end of the plot reflect the importance of P(Ca).

Odds, Odds Ratios, and Other Ratios of Probabilities
In epidemiology ratios of probabilities, especially ratios of conditional probabilities, are popular. The odds for having the disease D is calculated as:

Odds = P(D+)/ (1-P(D+) ) (44)


The odds ratio of a disease with or without the positive test T is:


Another odds ratio comes from the comparison made in equation 41. A good clinical test occurs when the odds of disease given the test result is much higher (or lower) than the odds of disease without use of the test. This odds ratio is given as:


Notice that the probability P(D+) in the denominator does not utilize the test result, and this is sometimes called the a priori probability of the disease. Using Bayes' theorem and algebra, this odds ratio can be rewritten as:


The relative risk of having the disease for those with and without a positive test is defined as:


Whereas the above emphasize the disease as an outcome, the likelihood ratio emphasizes the test and is given by:


(This likelihood ratio differs from another one used in general testing of hypothesis in statistics, so that it may be better to symbolize it as likelihood ratio (lab)—that is, as in for a laboratory test, T.)

In Babaian's 2x2 table on PSA, the odds ratio A was ~(.453/.547)/(.07/.93) or 10.9, the relative risk was ~.453/(1-.93) or 6.43 and the likelihood ratio (lab) was 0.696/(1-.827) or 4.02. (The ~ marks above indicate approximations because of truncation of the fractions given.)

Overview of Common Statistical Tests
The approach that follows would probably make professional statisticians cringe, because it prescribes a simplified approach to data analysis which can be complex. Nevertheless, I have found that this approach helpful, because it provides a way to categorize data analyses according to the types of random variables present. I recommend the reference texts for further important considerations. To begin, we are interested in one outcome, or dependent, random variable, which we symbolize as y, and our questions in analysis concern how y depends on one or more explanatory random variables, which we symbolize as x. The analysis we choose depends on the key characteristics of y and x as summarized below.

Key Characteristics of Random Variables
Continuous
Infinite
Always positive, like time.
Discrete
Binary like 0 vs. 1
Categorical like 1, 2, 3, etc.
Outcomes Paired ?
Single vs. Multiple x variables

The approach we take depends upon these key characteristics of the dependent and explanatory variables, and finally one needs to consider tests based on a parametric model versus a non-parametric or semi-parametric model.

y x Paired Multiple x Parametric Test or Model
binary binary no no no Chi-square
binary binary yes no no McNemar Chi-square
binary binary no no no Fisher Exact
binary binary no 3 no Fleiss Test of equal proportions
binary binary no 3 no Mantel-Haenszel
categorical categorical no no no Chi-square
continuous binary yes no yes Paired t test
continuous binary no no yes t test
continuous categorical no no yes One Way ANOVA
continuous binary no no no Wilcoxon
continuous categorical no no no Kruskal-Wallis
continuous continuous yes no no Correlation
continuous any yes any yes Linear Regression
variable variable yes any yes GLM
binary any yes any no Logistic
Failure time categorical yes no no Log-rank
Failure time any yes any semi Cox Model
Failure time any yes any yes Exponential, Weibull

Chi-Square Tests
The Chi-square test of statistical independence between two binary or categorical y and x variables has been the workhorse of medical statistics for decades. The Null for this test assumes that y and x are statistically independent, that is, that P(y/x) = P(y) and P(x/y) = P(x). Then the software estimates the probabilities P(y) and P(x) from the pooled data and without regard to each other. Next, the test compares the number of observed results for each category of y and x with that expected from the pooled estimates of P(y) and P(x). Specifically, it forms ratios comprising the squared difference between observed and expected numbers divided by the expected number as follows.

(no. observed – no. expected)2/no. expected

If there are r categories or possible values of y and c possible categories for x, then the total number of possible categories using both variables is r * c. If the numbers of observations in each combined category exceeds 5, then the sum of these ratios over all categories of y and x has an approximate chi-square distribution with (r-1) * (c-1) degrees of freedom, so that the statistic for this test follows a chi-square distribution. When the estimated chi-square is sufficiently large, then the deviations of observed from expected numbers are high, and the Null is rejected. For example, the 2x2 tabular data from Babaian et al. on PSA and a positive needle biopsy of the prostate yielded a chi-square value of 78 and a p value close to zero. Thus the Null of statistical independence between a PSA>4 ng/ml and presence of cancer in the biopsy could be rejected.

When the numbers in each combined category are smaller than 5, then the statistic does not follow a chi-square distribution, and one must resort to an alternative test such as the Fisher exact test, which relies on the geometric distribution.

Sometimes the categorical observations of y and x variables are paired. For example, if one were to evaluate two immunohistochemical stains on a set of tumors, one tumor from each patient. In this study the staining results would be paired, and in this circumstance one should not use the routine Chi-square test of independence. Instead, the McNemar variant of the chi-square test is more appropriate, and its chi-square statistic relies on just the discordant results for each pair of staining results.

Another variant of the chi-square test is required to test when one wants to test if the observed proportion with a positive result is the same across several studies. For example, the following table shows the observed probability that a patient had cancer of the prostate given a PSA value >= 4 ng/ml across four different studies (Babaian et al, Catalona et al, Brawer et al, and Morgan et al). The probability is listed as ppv, and the total number of patients is listed as n.

Study n ppv
Babaian 404 0.45
Catalona 750 0.34
Brawer 227 0.34
Morgan 5258 0.76

Whereas the ppv for the first 3 studies appear reasonable close to one another, that for the last study is approximately twice as high. Are these results significantly different from one another? The test of equality of proportions can provide an answer. First, a weigted estimate of the overall proportion positive is calculated. Then this estimate with its derived variance is used to once again calculate the difference between observed and expected values as above, and the result is another chi-square statistic. In S-PLUS this test is done with the call to prop.test, and for these data it yielded a chi-square statistic of 358 (p~0) for the Null that the observed proportions were the same. Thus, we can conclude that there were significant differences between these 4 studies, and in fact the design for the first 3 differed from that of the 4th. Whereas, the first 3 assayed serum PSA in men who underwent biopsy of the prostate, the 4th collected PSA data from 2 populations, one selected because they had a positive biopsy for prostate cancer and the second or control group because they had no evidence for prostate cancer in follow-up. In the 4th study all of the patients in the first group had biopsies of the prostate, but few in the second group were biopsied. In other words the non-cancer groups in the first 3 studies included many men with BPH; whereas, the non-cancer group in the 4th study included many who had no abnormality of their prostate. The differences between the 2nd and 3rd studies and the fourth can also be viewed in the prior ROC curves, which demonstrate a higher sensitivity for the 4th study at any given point of 1 – specificity.

Finally, the Mantel-Haenszel chi-square test is done to test for statistical independence between y and x when there is a third confounding variable present. The third variable could be the presence of another disease, a drug or that the observations came from different institutions or overall categories. For example, Morgan et al published the frequencies of patients with prostate cancer (y variable) versus PSA levels >= 4 ng/ml (x variable), stratified by 8 combined categories of age and race. A Mantel-Haenszel test for the relationship between presence of cancer and PSA controlled for these 8 different categories produced a chi-square value of 2519 (p~0), which of course shows that there was a significant relationship between PSA and presence of cancer.

t Tests
When one wants to test whether the means of a particular random variable, y, are the same in two separate populations, the t test is appropriate, so long as y is normally distributed and its variance is the same in the two populations. Consider the follow example. Serum PSA was measured on men prior to prostatectomy, and at prostatectomy the men were sorted into 2 groups: the first without extra-capsular tumor and the second with extra-capsular tumor. The question to consider is whether these two groups had the same mean of PSA. Because PSA was not normally distributed, we used the natural logarithm transformation, and the following plot shows that log(PSA) in these 2 groups was approximately normally distributed (the top is for those without extra-capsular tumor and the bottom is for those with extra-capsular tumor).



Although these graphs do not demonstrate a large difference in PSA, the t test showed that the means for the 2 groups were significantly different (t = -2.2, p = 0.02). The t test can also be used for paired data.

Once commonly used, the t test has now become less popular, and this is largely because modern Null hypotheses often address outcomes other than the difference between 2 means and involve multiple explanatory variables. Furthermore, in the past it was common to see medical reports include multiple pair-wise comparisons of random variables using a t test for each pair and without regard to how this changes the alpha for Type I errors, but now this practice is less common. Still the t test can provide a useful way to relate a continuous variable to a binary one (i.e. 2 groups), provided that the variable is approximately normally distributed.

One Way Analysis of Variance (AOV)
AOV provides a way to test the Null that means of a normally distributed random variable y are the same across several levels of a categorical variable x. For example, the following boxplot shows the distribution of Log(PSA) for several Gleason Scores. Visually, it looks as if there is little difference in PSA in the different Gleason Scores, and the one way aov confirmed that there was no reason to reject the Null (F = 0.4, p = 0.5). (Note, however, that the several have documented that when PSA is divided by tumor volume, this ratio decreases with Gleason grade.)



Wilcoxon Test
The Wilcoxon test is a non-parametric one, but otherwise it deals with the same data situation as the t test. In other words, it is appropriate for the Null that a continuous variable is the same for 2 groups of patients, and it can deal with paired or non-paired data, although the software requires that this information be provided. If we symbolize the variable from the 1st group as y1 and the variable from the 2nd group as y2, then what the Wilcoxon test does first is to rank the values of y1 and y2 collectively in one virtual row. Then it forms the sum, Ry2, of the ranks of y2. When Ry2 is either too large or too small, the Null is rejected. This test is also equivalent to the Mann-Whitley test based on the calculation of a Uy2 statistic, which provides the number of times the y2 variable is larger than the y1 variable. In the PSA data used above for the t test, the Wilcoxon test for the Null that the PSA values were the same for those with and without extra-capsular tumor produced a z statistic of -1.9 and a p value of 0.057. Note that this p value is larger than what the t test produced and that no logarithm transformation was required.

Kruskal-Wallis Test
Just as the Wilcoxon test is the non-parametric counterpart to the t test, the Kruskal-Wallis test is the non-parametric counterpart to the aov. This test is for the Null that the values of a continuous y variable in several categories of an x variable are the same. Like the Wilcoxon test, the Kruskal-Wallis test orders the values of y along a single virtual row and then sums the ranks for each category of x. It then computes an H statistic based on the sum of squared values of these ranks divided by the number of patients in each x category. If k is the number of categories of x, then H has an approximate chi-square distribution with k-1 degrees of freedom, so that the final test statistic is a chi-square. In the PSA data used above for the aov above, the Kruskal-Wallis test for the Null that the PSA values were the same for the different Gleason scores produced a chi-square statistic of 2.7 and a p value of 0.75, so that once again the Null cannot be rejected.

Correlation
Correlation deals with how closely one continuous variable, y, is related to a second continuous variable, x. The most commonly used measure of correlation is the Pearson correlation coefficient, r, which can vary from -1 (negative, perfect correlation) to +1 (positive, perfect correlation). When r = 0, there is no correlation. The Pearson correlation coefficient is closely related to linear regression and is calculated by relating values of y and x to their respective means. The Spearman correlation coefficient uses the same algorithm but is performed on the ranks of y and x rather than on the raw data, and it also has the same range and interpretation. Consider the following plot relating pre-operative serum PSA to prostate tumor volume as measured by morphometric technique in the prostatectomy specimen (Humphrey et al data).



Clearly, this plot demonstrates that the relationship between PSA and tumor volume is noisy. The Pearson correlation coefficient was 0.55, and the Spearman coefficient was 0.23. The square of the Pearson correlation coefficient estimates the fraction of the deviation explained by the regression relationship, and for this example it was just 30%.

Linear Regression
Linear regression analysis seeks to model the value of a continuous variable, y, in terms of one or more x variables, so that y can be expressed by the equation:

y = α + β1*x1 + β 2*x2 + β 3*x3 + … (50)


Here, α is an intercept coefficient and the β are weighting coefficients to be multiplied times the values of the respective x variables. The y variable can be a transformed value of y such as log(y), and the x variables can also be transformed. If we symbolize the right hand side of the above equation as f(α,β1, β2, β3, …), then the linear regression algorithm estimates the values of α,β1, β2, β3, … by using calculus and linear algebra to minimize the square of the deviations between y and f(α,β1, β2, β3, …) over all the patient observations. In other words, the algorithm estimates coefficients so that

Sum of Squared Deviations = Σ [yi – fi(α,β1, β2, β3, …)]2 (51)


is minimized across all the patients (indexed here by i). If the residual errors are normally distributed, then t statistics can be used to test the Nulls that the individual α,β1, β2, β3, … etc. are equal to 0. If the p value for any of these coefficients is low, then one rejects the Null and concludes that that particular x is significantly related to y.

For example, consider once again Humphrey et al.'s data relating tumor volume in the prostatectomy specimen to pre-operatively measured variables, including serum PSA, and which was illustrated in the last plot. Linear regression analysis demonstrated that factors in addition to PSA were important. PSA density and Gleason score did not relate to tumor volume (p>0.10), but the 4 pre-operatively measured variables in the following table combined to significantly relate to post-operatively measured tumor volume in the whole prostate.

Variable Coefficient t p value
PSA 0.202 7.0 0.000
Tumor Length 0.124 2.7 0.0076
Exp(no. Pos. Cores) 0.048 6.0 0.000
Age -0.0082 -2.0 0.049

The number of positive cores is abbreviated "no. Pos. Cores", and the reason the exponential transformation was used was because a preliminary plot of tumor volume versus the number of positive cores suggested an exponential relationship. The intercept parameter α was not significantly different from zero and so was dropped. The following histogram demonstrates that its residuals approximated a normal distribution, so that it appeared safe to rely on normal theory for the results.



This time, the combination of these variables explained 76% of the variation in the data, but this still left too much noise to encourage its use in a clinical setting, as demonstrated in the following plot. Here, tumor volume predicted by the model appears on the horizontal axis, and observed tumor volume appears on the vertical axis.



Because there is so much residual noise in the plot, it demonstrates that models and variables with low p values and high levels of significance may not be sufficient to produce a clinically useful model.

The General Linear Model (GLM)
The GLM is analogous to the linear model except that GLM does not assume that the residuals are normally distributed. Nevertheless, GLM does assume that the variance of the dependent random variable, y, is constant. The GLM has the same linear form as in linear regression:

Y = α + β1*x1 + β 2*x2 + β 3*x3 + … (52)


However, Y now represents a transformation of the originally observed y variable, and the y variables may be either continuous or categorical.

Instead of obtaining a least squares fit to the collected data of {y, x1, x2, … }, the GLM obtains estimates of α and β coefficients that maximize a likelihood function, L, or its natural logarithm ln(L). L is the sample distribution function and provides the probability of observing a particular sample of patients, their outcomes y, and their x1, x2, x3, … values, given particular values of the α and β coefficients (see equation 33). What GLM does then is to choose values of α and β so that this sample probability is maximized. The logic follows the argument that if we observed the collected sample, then its probability, L, must be relatively high. Consequently, we select values of α and β to maximize L. The exact form of L depends on the nature of y. GLM obtains its solutions for α and β coefficients through an iterative least squares fitting procedure.

The Null for the GLM is one without using any of the x variables (or equally all their β are zero), and the likelihood function for this Null is symbolized as Lo. To test the importance of one or more x variables we compare the final maximized L using the variables to the initial Lo without the variables. It turns out that -2*ln(Lo/L) has a chi-square distribution with the number of degrees of freedom equal to the number of x variables used in the model, and the result is termed the "likelihood ratio test" for testing the significance of one or more of the x variables, because it depends on the ratio Lo/L. The next model discussed, the logistic model, provides an example. Others can be found in the McCullach and Nelder text.

The Logistic Model
The logistic model deals with a dependent random variable, y, which is binary, that is, either 0 or 1, and the transformation of y which is considered linear is the natural logarithm of the odds as follows:

Log{odds (y=1) } = α + β1*x1 + β 2*x2 + β 3*x3 + … (53)


Because the odds (y=1) is defined in terms of probability P(y=1) as:

Odds (y=1) = P/(1 – P), (54)


the probability P can be written as:

P(y=1) = 1 / {1+exp(-E) } (55)


with E given as:

E = α + β1*x1 + β 2*x2 + β 3*x3 + … (56)


The logistic model can be used to test Null hypotheses that the x variables are not related to P(y=1) as well as to model the relationship between P(y=1) and the x variables; however, Peduzzi et al. suggest that for the logistic analysis to produce reliable results, the data should include at least 10 events for each x variable with event being defined as the smaller number of those with either y=1 or y=0.

Using the Logistic Model to Compare Two Diagnostic Tests
The following example uses the logistic model to test whether serum levels of Troponin I add diagnostic information to serum levels of CK-MB for the diagnosis of acute myocardial infarction (AMI). The data come from a report by Chang et al. (1998) and comprise 110 patients. They considered both serum tests as binary results, that is, as either negative or positive, and they reported the following table of results that included sensitivity, specificity, PPV and NPV as percentages.

Marker Sensitivity Specificity PPV NPV
CK-MB 92.7 89.9 84.4 95
Troponin 90.2 95.7 92.5 94.3
Both 100 88.4 83.7 100

Although these results suggest that using both markers increases sensitivity and NPV, the entries of this table do not provide us a statistical test of the importance of each marker for the diagnosis of AMI. Nor do they tell how much weight to give to each of these 2 tests. A logistic regression analysis on their reported data, however, provided greater information and produced the following table of results.

Marker Coefficient p value
CK-MB 3.47 ~0.000
Troponin 4.18 2.2x10-7

The very low p values tell us that both CK-MB and Troponin I provide significant information about the diagnosis of AMI and that Troponin adds information to what CK-MB provides. The magnitude of the coefficients suggests that both markers are weighted approximately equally for their impact on the diagnosis of AMI. Furthermore, the coefficients could lead to a diagnostic algorithm for applying to new patients. For example, these results suggest that the probability of AMI (i.e. P(AMI) ) should be given by:

P(y=1) = 1 / {1+exp(-E) } (57)


with E given as:

E = α + 3.41*CK-MB + 4.18*Troponin (58)


α would need to be adjusted to reflect the local overall prevalence of AMI in the local population of patients for which cardiac markers are found to be negative. For example, if that prevalence for those with negative CK-MB and Troponin I in the local emergency room is 0.02, then α would be calculated as:

α = ln (.02/.98) = - 3.89 (59)


Of course, before using such an approach, one would first need to validate it with a new set of study patients.



Survival Analysis
Whereas logistic regression analysis deals with a single binary outcome, survival analysis deals with two outcomes: a binary failure event like death and the time to the occurrence of that event. The data then consists of the following categories:

Failure Event is 1 if it occurred at the last observed time, otherwise it is 0

Time of last observation, T

Explanatory variables, x1, x2, …

The most commonly studied failure event in medicine is death, but any other binary failure event connected with time can also be studied, and alternatives to death include tumor recurrence, metastasis, and diagnosis of malignancy. In industrial settings the event might be time to failure of critical equipment or a strike. Furthermore, the events need not be restricted to what we ordinarily consider failures. For example, one could examine the event of cure and analyze the time to cure. Similarly, the usual T variable is time, but it could be another positive, continuously increasing variable.

We assume, for the moment, that each patient is observed continuously for times less than the last observed time. If the patient failed at the last observed time, then the value of the event is 1, and the patient is said to be uncensored. If the patient has not failed at the last time, then the value of event is 0, and the patient is said to be censored at the last time, or more specifically, right-censored. One of the great strengths of survival analysis is its ability to deal with censored patients, but there is also a cost. In general, the key results depend on and weight the uncensored patients, so that data with many censored observations provide limited results. For example, Concato and Peduzzi et al. warn that there should be at least 10 uncensored patients per explanatory x variable.

An example of survival data is the following table, which comes from recently published study by Vollmer et al. regarding the serum marker CYFRA 21-1 in patients with advanced stage non-small cell lung cancer. The purpose of this pilot study was to test whether a drop in serum CYFRA after the first round of chemotherapy would be associated with subsequent survival—if so, then the change in CYFRA might be an early measure of response to treatment. Three lines of the data are given as follows:

Stage Dead T (months)
IIIB 1 18.9
IV 1 3.1
IV 0 12.2

Here, the variable Dead was the binary event of observed death at the last time of observation, so that entries of "1" here mean that the patient was followed to death. An entry of "0" means that the patient was alive at last follow-up, and T gives the time of follow-up in months. 38 of their reported patients had been observed until death and so were uncensored. Potential explanatory variables included stage, clinical response to chemotherapy, initial serum level of CYFRA and the change in CYFRA after the first round of chemotherapy.

Survival analysis addresses the survival probability, S(t), which is defined as the probability that the time of failure, T, will exceed t. Specifically,

S(t) = P(T > t). (60)


The Kaplan-Meier method derives estimates of S(t) for each t, and the resulting plot gives the estimated S(t) versus t. For example, the following Kaplan-Meier plot provides S(t) vs. t for the CYFRA pilot study.



The estimate of S(t) is the dark, stairstep solid line and the lighter dashed lines provide upper and lower 95% confidence limits for the estimate. Characteristically, S(t) is 1.0 at t=0, and for increasing values of t, S(t) becomes progressively smaller. The vertical steps down are time points when one or more deaths occurred; and the vertical bars along the graph show time points when patients were censored, that is, when they were not followed any further. In general such censored patients are used for the denominator of the probability but never for the numerator. In this particular plot we see that after 20 months of follow-up the expected survival for these patients with advanced stage non-small cell carcinoma was zero, although a few were still alive. Also characteristic is the way the confidence limits increase as time increases, because for later times the number of observed events (i.e. uncensored patients) often becomes small.

What much of survival analysis does is to test whether certain sub-groups of patients, or their explanatory variables, account for different survival probabilities. For example, in the study of serum CYFRA in advanced non-small cell carcinoma, we wanted to see if changes in this serum marker after the first round of chemotherapy related to subsequent survival. If this happened, then such changes could constitute an early measure of response to treatment, that is, one which is known before all courses of the treatment are completed. The following four Kaplan-Meier plots demonstrate respectively how clinical stage (upper left), clinical response (upper right), initial level of CYFRA (lower left), and drop in CYFRA after the first treatment (lower right) affect survival.



These 4 curves demonstrate that whereas clinical stage and clinical response tend to produce overlapping Kaplan-Meier curves, the initial level of serum CYFRA and a drop in CYFRA of at least 27% after the first round of treatment tend to produce survival curves that separate from one another.

These 4 Kaplan-Meier curves represent a graphical univariate analysis of survival time. The most popular univariate statistical test that matches this graphical analysis is the log-rank test. For example, in the above 4 Kaplan-Meier plots, the patients were broken into 2 or 3 groups according to, respectively, clinical stage, clinical response, initial serum level of CYFRA, and magnitude of drop in CYFRA after first treatment. The log-rank test compares the observed survival for each of these groups against that predicted from the entire data without the groups (e.g. from the previous Kaplan-Meier plot for this study), and as before this comparison, summed over groups, results in a chi-square statistic. For the above 4 univariate comparisons the log-rank tests revealed p values of: > 0.2, 0.27, 0.17, and 0.21. Thus, by univariate analysis none of these 4 variables related to survival time.

The Calculus of Survival Probability and Hazard Function
The shapes of Kaplan-Meier survivial curves suggest an exponential type of function for S(t) like:

S(t) = exp(- h * t) (61)


where h is called the hazard function. For example, the following graph shows the plots of 3 S(t) curves with the 3 h values of 0.04, 0.085 and 0.20.



The middle plot shows a survival curve very close to the observed Kaplan-Meier curve for the CYFRA study, and its value of h is 0.085. The top curve has a lower value of h of 0.04, and the bottom curve has a higher value of h of 0.20. These 3 curves demonstrate a general phenomenon for survival curves. Namely, lower hazard values imply higher survival curves and longer survival times, and higher hazard values imply lower survival curves and shorter survival times.

Regardless of whether S(t) follows an exponential distribution like equation 61, the hazard function can be defined as a conditional probability, namely, the instantaneous probability of failure in a small time interval about t, given no failure up to time t. Specifically,


Thus, h is a differential. The numerator is the probability density of 1 – S(t) given survival to time t, so that using the definitions of S(t), a differential in calculus and conditional probabilities, h(t) can be rewritten as:


Integration of both sides implies that the survival probability in general relates to the hazard function as follows:


with u being a variable of integration and with the integration limits occurring from 0 to t.

The hazard function and the above details relating h to S(t) are important, when we want to perform an analysis of survival time that utilizes or tests multiple explanatory variables, although the software packages take care of the details of calculus for us. Probably the most popular method of multivariate survival analysis is the Cox model.

The Cox Model
Cox introduced his method in 1972, and now over 30 years later it is used uncountable times each year. The Cox model is designed to analyze survival time and to relate it to multiple explanatory variables, symbolized once again as x1, x2, …, etc. The model assumes that the hazard function for any patient is proportional to an unspecified baseline hazard, h0(t). Thus, the hazard, h(t), can be given by the following equation:

h(t) = h0(t) * exp( b1*x1 + b2*x2 + b3*x3 + …) (65)


The x1, x2, x3, etc. provide the values of the explanatory variables for the patient, and the b1, b2, b3, etc. are fixed coefficients, whose values are estimated by the Cox model software. Because the baseline hazard function, h0, is left unspecified, the Cox model is considered semi-parametric.

An important assumption of the routine Cox model is that neither the coefficients, b1, b2, b3, etc., nor the values of the explanatory variables, x1, x2, x3, etc., change with time. Thus, the exponential term in equation 65 can be removed from inside the integral and S(t) can be rewritten as:


Notice also that equation 63 can also be written using the natural logarithm as:

log(h/h0) = b1*x1 + b2*x2 + b3*x3 + … (67)


Although this looks like a linear regression equation, the exponential functions here and in the expression for survival distributions imply that this equation cannot be solved using straightforward least squares techniques. Instead, the Cox model obtains solutions for b1, b2, b3, etc. through an iterative process that maximizes a partial likelihood function. In practice this process has been found to be robust. Furthermore, with modern desk-top computers, the analysis executes in seconds. In addition to estimating the values of b1, b2, b3, etc., the Cox algorithm also estimates their standard errors, obtains p values for the Null hypotheses that their values are 0, and estimates the overall model likelihood ratio.

To illustrate, consider once again the CYFRA pilot study. Whereas, the four previously examined variables yielded relatively high p values by the log-rank test, the univariate approach examined just one variable at a time while leaving the other three uncontrolled. The following table demonstrates how these 4 variables related to hazard (and thus via equation 62 to S(t) ).

Variable Coefficient(b) se(b) p value
Stage IIIa ------ ----- 0.14
Stage IIIb ------ ----- 0.16
Clinical Response ------ ----- 0.18
CYFRA Level 0.61 0.15 0.00003
>27% Drop in CYFRA -1.14 0.41 0.005

(The coefficients and standard errors are omitted for the non-significant variables.) In other words after controlling for the varying mass of tumor with the surrogate variable of initial serum level of CYFRA, the study found that those whose CYFRA level dropped > 27% after the first round of chemotherapy experienced a significant improvement in survival. These results implied that chemotherapy was helping if CYFRA dropped by > 27%, but the results also implied that those who did not experience such a drop were at a disadvantage.

One way to utilize the results of a Cox analysis with more than one significant explanatory variable is to form a composite hazard score (HS). For example, the results of the CYFRA study suggest the following HS.

HS = 0.61*log(CYFRA) – 1.14* R (68)


where CYFRA is the serum level before the first treatment and where R is a dummy variable that is 1 if there was > 27% drop in CYFRA after the first treatment and otherwise 0, and where the coefficients of the Cox model are multiplied times these variables. Thus, HS comprises information about the mass of tumor as well as about how the tumor initially responds to chemotherapy. Using the cutpoint of HS=0.284, the following Kaplan-Meier plot shows the expected survival for those with higher HS (lower curve) vs. those with lower HS (upper curve).



For the patients in the group with lower HS, continuing the chemotherapy to additional courses seems prudent, because they appear to benefit. However, for the patients in the higher HS group, one should consider either discontinuing or changing the chemotherapy.

Testing the Assumptions of the Cox Model
One of the assumptions made in the routine Cox model is that the proportionality in the hazard does not change with time, that is, neither the coefficients nor the explanatory variables change with time in equation 65. Thus, it is prudent to see if this assumption holds. In S-PLUS this can be done both visually as well as with a chi-square statistic using the cox.zph function. Nevertheless, Therneau and Grambsch suggest that the routine Cox model often yields adequate models even when its assumptions do not strictly hold. If the assumptions are violated more significantly, then one may have to use a Cox model with time-dependent covariates, an alternative time scale, or one of several parametric models such as the exponential or Weibull models.

Portability of Cox Model
If one has developed a Cox model using large numbers of patients and has validated it with new data, then it is reasonable to use the model to predict outcomes in new patients. The proportional hazard model implies that the survival probability for patient no. 1, i.e. S1, relates to the baseline survival function S0 as follows:

S1 = S0(h1/h0) (69)


where h1 is the hazard for patient no. 1, h0 is the baseline hazard and (h1/h0) is the exponent to which S0 is raised. However, the same relationship also holds for any other patient, arbitrarily designated here as patient no. 2, so that

S2 = S0(h2/h0) (70)


But equation 68 can be re-written as:

S2 = S0(h1/h0)*(h2/h1) (71)


implying that

S2 = S1(h2/h1) (72)


Equation 70 says that if one knows the survival for patient no. 1, then the survival for patient no. 2 can be calculated from the ratio of their hazards. This ratio can be evaluated using equation 65 as follows. If there are 3 prognostic variables with values of y1, y2, y3 in patient 2 and x1, x2, and x3 in patient 1, then

log(h2/h1) = b1*(y1 - x1) + b2*(y2 - x2) + b3*(y3 - x3) (73)


Thus, the Cox model allows one to estimate survival probability, S2, in a new patient so long as the coefficients b1, b2 and b3 are known, S1 is known and the 3 key variables are known in the original as well as in the new patient.

The following example illustrates this approach. In 1985 Soong published details of a Cox model for malignant melanoma (see Soong, 1985). For those of pathologic stage I and who had not had elective lymph node dissection, he reported that three variables were important to survival: tumor thickness, ulceration and tumor location (axial vs. other). He gave the coefficients for these 3 variables, and he also supplied 4 survival curves for 4 prototypic patients, whose values of thickness, ulceration and tumor location were also reported.

In 1986 Phillips et al. reported their observations on 21 cases of minimal deviation melanoma. Their patients tumor thickness ranged from 1.6 to 10.4 mm, and the follow-up ranged from 18 to 96 months. A Kaplan-Meier survival curve of their data is given below.



The question of interest is whether the observed long survival times for these 21 patients with minimal deviation melanoma also occur for patients with ordinary melanoma. To evaluate this question, we used Soong's model, because both sets of data included the tumor thickness, presence of ulceration and location of tumor. Soong's reported coefficients for thickness, ulceration and axial tumor location were respectively 0.4844, 0.8524 and 0.5004. We then matched each of Phillips' 21 patients to one of Soong's prototypic patients with survival curves so that gender and age were close. We calculated the ratio h2/h1 using equation 71 and the observed values of tumor thickness, ulceration and tumor location for both the new patient and the prototypic patient. Finally, we calculated the expected probability of surviving to the observed last time of follow-up for each of the minimal deviation melanoma patients using equation 70. All this analysis assumed that the 21 patients had ordinary melanoma.

The result was that the probability of surviving to the time of follow-up in most of these patients was high assuming their survival was like ordinary melanoma. The mean probability of surviving to the observed time was 0.65 (range from 0.2 to 0.88), and 16 of the survival times (76%) had better than even odds of being observed if the tumors were ordinary melanoma. Just 3 of survival times were less probable for ordinary melanoma, because their survival probabilities were 0.20, 0.27 and 0.32. Thus, the majority of reported patients with minimal deviation melanoma had a survival time much like ordinary melanoma, and this result does not support the hypothesis that minimal deviation melanoma has a better survival time than ordinary melanoma.

Laboratory Test Turnaround Time (TAT)
Clearly, one of the key variables we monitor in the laboratory is the turnaround time (TAT), that is, the time lapse between receiving a specimen and reporting its results. Like survival time, TAT is always greater or equal to zero. Unlike the situation in survival time, we view the reporting of a test as a positive event, not as an event of failure. Nevertheless, the approaches used to study survival times can be applied to TAT if we simply view the time of reporting the test as the end of its "life" in the laboratory.

Consider an example of TAT for surgical pathology cases during a two month period of time. The Kaplan-Meier plot for this data appears below:



One can then study and plot the TAT by method, or by pathologist. The following is a Kaplan-Meier plot of TAT in surgical pathology for two different pathologists:



One had a mean TAT of 1.25 days, and the other had a mean TAT of 1.5 days, a difference which was significant (p=0.00006 by log-rank test) but which was of little importance in magnitude.

Analysis of Lymph Nodes in Colon and Rectal Carcinoma
It has now been well established that patients with lymph node metastasis in colorectal carcinoma do worse and benefit significantly from chemotherapy. In this regard pathologists know that it is important to find as many nodes as possible in excised specimens of colon and rectum, because the more nodes one examines, the more likely some will be found to be positive. This is simple logic, and this notion is supported by recent observations that among those patients with N0 stage (i.e. all nodes were found to be negative) those with fewer nodes examined survived a shorter period of time. The result implies that when few nodes are found, there is a significant chance that metastases are missed. In other words, the stage assignment of N0 in such patients is misleading. Although the number of 13 nodes has been repeatedly recommended, we know that sometimes there are not 13 nodes present, no matter how carefully we dissect the specimen. In fact, the number of lymph nodes found not only depends upon the pathologist. It clearly also depends on how much tissue the surgeon removes. The question to consider is whether a probability analysis can help us recognize N0 patients who are at risk of having metastases that were not found.

The Poisson probability function provides a theoretical model that is appropriate for counting positive lymph nodes in colorectal specimens. Specifically, it gives us an equation for the probability P of finding x nodes positive when n total were examined. This equation is given as:


Here, x! stands for x factorial. α gives the underlying probability that any node is positive for tumor, and it can be estimated from the relative frequency of nodes that are positive. Undoubtedly, α reflects individualistic factors involving the patient and most likely their T stage of tumor. For example, using 157 patients with colorectal carcinoma and lymph nodal metastases, we found that the estimated value of α was significantly related to T stage (p=1.4x10-9). The values of α for T stages T1, T2, and > T2 were respectively 0.091, 0.125, and 0.304. The following plot illustrates how the Poisson function matches our intuitive understanding because it demonstrates that the probability of finding at least one node positive for tumor rises with the number of lymph nodes examined.



If α > 0, then the patient is presumed to have metastases, whether or not we find them. If, by contrast, α = 0, then the probability of finding metastases is 0, regardless of how many lymph nodes we examine. Thus, α provides the underlying, hidden state of nodal metastases for any given patient. Although , α is hidden, we can observe the number of positive nodes, x, and the total number of examined nodes, n, and we can also estimate α as the ratio of x to n.

Suppose that for a given patient, α > 0, that is, there are some lymph node metastases present. The Poisson model predicts that the probability of finding no metastases in this situation, that is, the probability of finding a false N0 state decreases as the total number of examined nodes increases. This result fits our intuitive understanding and is illustrated in the following plot.



In these ways the Poisson probability function seems appropriate for understanding some aspects of lymph nodal in colorectal carcinoma; however, the next issue to consider is how to use the Poisson model when we have discovered no nodes positive after examining n nodes. Here, the Bayes' theorem can help.

In this situation, the Bayes theorem will be used to estimate the conditional probability that the patient has nodal metastases when, in fact, none were found. Specifically, we use the Bayes theorem here to calculate the conditional probability that α>0 (remembering that when α>0 metastases are present), when x = 0. The Bayes theorem is then written as:


P(Mets) is the a priori probability of lymph nodal metastases, which can be estimated from prior studies of colorectal carcinoma, or from a current set of data. The Poisson formula (equation 74) provides P(x=0|α>0), and the Poisson equation also indicates that P(x=0|α=0) is 1.0. In this way the Bayes formula can be simplified to:


We used a study set of 213 patients with colorectal carcinoma and negative lymph nodes to test the importance of the Bayes formula for P(Mets|x=0). We estimated the odds of no metastases from our data according to T stage and found that these odds were respectively 5, 3.24 and 0.916 for T stages T1, T2 and > T2. We used the T stage specific values for α as before, and then calculated values of P(Mets|x=0) for each patient. These values ranged from 0.0 to 0.45 with a median of 0.10. A Cox survival model analysis demonstrated that the calculated P(Mets|x=0) was significantly related to overall survival (p=0.0006) and that T stage contributed no further prognostic information (p>0.8). The following plot demonstrates that those with P(Mets|x=0) > 0.16 had a shorter survival time (median 5 years vs. 8 years for the others).



In this fashion the combination of the Poisson model and the Bayes model provide a tool that can segregate those with apparent N0 status into two groups: one with lower and one with higher risk for earlier death.

References
    Probabilities
  1. Billingsley P. Probability and measure. New York. John Wiley & Sons. 1995.

  2. Friestedt B, Gray L. A modern approach to probability theory. 3rd Ed. Boston. Birkauser. 1997.

  3. Vecchio Tj. Predictive value of a single diagnostic test in unselected populations. NEJM 1966;274:1171-1173.

  4. Galen RS, Gambino SR. Beyond normality. The predictive value and efficiency of medical diagnosis. New York. John Wiley & Sons. 1975.

  5. Beck JR, Shultz EK. The use of the relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 1986;110:13-20.
    General Statistics
  1. Casella G, Berger RL. Statistical inference, 2nd Ed. Pacific Grove, CA. Duxbury. 2002.

  2. McCullagh P, Nelder JA. Generalized linear models. 2nd Ed. London. Chapman and Hall. 1989.

  3. Venables WN, Ripley BD. Modern applied statistics with S-PLUS. 3rd Ed. New York. Springer. 1999.
    Logistic Model
  1. Hosmer DW, Lemeshow S. Applied logistic regression. 2nd Ed. New York. John Wiley & Sons. 2000.

  2. Harrell FE, Jr. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York. Springer. 2001.

  3. Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373-1379.
    Survival Analysis
  1. Cox DR, Oakes D. Analysis of survival data. London. Chapman and Hall. 1984.

  2. Therneau TM, Grambsch PM. Modeling survival data. Extending the Cox model. New York. Springer. 2000.

  3. Hosmer DW, Lemeshow S. Applied survival analysis. New York. John Wiley & Sons. 1999.

  4. Harrell FE, Jr. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York. Springer. 2001.

  5. Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. J Clin Epidemiol 1995;48:1495-1501.

  6. Peduzzi P, Concato J, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995;48:1503-1510.
    Data Used in Analyses
  1. Babaian RJ, Mettlin C, Kane R, et al. The relationship of prostate-specific antigen to digital rectal examination and transrectal ultrasonography. Cancer 1992;69:1195-1200.

  2. Catalona WJ, Hudson MA, Scardino PT, et al. Selection of optimal prostate specific antigen cutoffs for early detection of prostate cancer: receiver operating characteristic curves. J Urol 1994;152:2037-2042.

  3. Brawer MK, Aramburu EAG, Chen GL, et al. The inability of prostate specific antigen to enhance the predictive value of prostate specific antigen in the diagnosis of prostatic carcinoma. J Urol 1993;150:369-373.

  4. Morgan TO, Jacobsen SJ, McCarthy WF, et al. Age-specific reference ranges for serum prostate-specific antigen in Black men. NEJM 1996;335:304-310.

  5. Van der Poel HG, Boon ME, van Statum P, et al. Conventional bladder wash cytology performed by four experts versus quantitative image analysis. Mod Pathol 1997;10:976-982.

  6. Humphrey PA, Baty J, Keetch D. Relationship between serum prostate specific antigen, needle biopsy findings, and histopathologic features of prostatic carcinoma in radical prostatectomy tissues. Cancer 1995;75:1842-1849.

  7. Chang C-C, Ip MPC, Hsu RM, Vrobel T. Evaluation of a proposed panel of cardiac markers for the diagnosis of acute myocardial infarction in patients with atraumatic chest pain. Arch Pathol Lab Med. 1998;122:320-324.

  8. Vollmer RT, Govindan R, Graziano SL, et al. Serum CYFRA 21-1 in advanced stage non-small cell lung cancer: an early measure of response. Clin. Cancer Research 2003;9:1728-1733.

  9. Soong S-J. A computerized mathematical model and scoring system for predicting outcome in melanoma patients. in Balch CM, Milton GW. Cutaneous melanoma. Philadelphia. JB Lippincott Co. 1985. chapt. 20. pp 353-371.

  10. Phillips ME, Margolis RJ, Merot Y, et al. The spectrum of minimal deviation melanoma: a clinicopathologic study of 21 patients. Hum Pathol 1986;17:796-806.