Normality and the Parametric Paradigm

Larry H. Bernstein, MD

plbern@yahoo.com

This article is about the measure of central tendancy and dispersion of values around the center (mean or median), the underpinning of parametric methods of comparison of 2 or more sets of data. More importantly, it is the beginning of a statistical journey. The clinical laboratory deals with large volumes of patient data. The use of a parametric approach is limited and is prone to problems introduced in the clinical domain. Consequently, Galen and Gambino introduced the concept of predictive value and the effect of prevalence in a Bayesian context in Beyond Normality. These calculations work off of tables, the same tables that are used for sensitivity and specificity, and are used to calculate a chi squared probability. The subsequent influence of epidemiology went further in introducing odds and odds ratios. The third and last article will address the more recent advances beyond, beyond normality. These improvements have all come about by the development of a powerful statistical methodology that is not constraint by the parametric paradigm and is well developed for hypothesis generation and validation, not just testing of simple hypotheses.

We have grown up with the normal curve and have incorporated it in our thinking, not just our work. Even the use of the term Six Sigma for reduction of errors has reference to the classical “normal curve” introduced by Johann Carl Friedrich Gauss (1777–1855). The normal or “bell shaped” curve is a plot of numerical values along the x-axis and the frequency of the occurrence on the y-axis. If the set of measurements occurs as a random and independent event, we refer to this as parametric, and the distribution of the values is a bell shaped curve with all but 2.5% of the values included within both ends, with the mean or arithmetic average at the center, and with 67% of the sample contained within 1 standard deviation of the mean. The reference to normality has been used with respect to student test scores, with respect to coin flipping and games of chance, with respect to investment, and in our experience with respect to errors of quality controlled measurements. The expected value we refer to as the mean (closest to the true value), and the distance from the mean (or scatter) we refer to as dispersion, measured as the standard deviation. Viewed in this light, we can convert the curve from a standard curve with an actual mean to a standard normal curve with a mean at the center of “0”, and with distances from “0” in standard deviations. A bad example of this is the distribution of serum AST measurements of a large unselected population enrolled in a clinical trial. The AST values tend to have many high values, which we call skewness to the right of the curve, so the behavior we are looking for is better described by a log transformation of the values to minimize nonlinearities in the measurement. This is illustrated by the comparison of AST and log(AST) in Figure 1.

What has not been said is that we view a reference range in terms of a homogeneous population. This means that while all values might not be the same, the values are scattered within a distance from the mean that becomes less frequent as the distance is larger so that we can describe a mean and a 95% confidence interval around the mean. In mineralogy we can measure physical elements that have structure defined by a relationship of structure to spectral lines. Hence, the scatter about the mean is very small because of the precise measurements, even though the quantity may be very small. This is not necessarily the case with clinical laboratory measurement because of hidden variables, such as – age, diurnal variation, racial factors, and disease. One way to level the playing field is to compose uniform specimens for quality control that are representative of a population for comparison of laboratory measurements among many laboratories, which is established practice. What is assumed is that a “normal” population is that population that is found after we remove bias, or contamination of the population by the hidden variable effects mentioned above. Therefore, parametric statistics is actually a comparison of one or more populations that are to be compared with the hypothetical normal population. The test of significance is a comparison of A and B with the assumption that they are sampled from the same population, but when they are found to have different means and confidence intervals by a t-test or an analysis of variance, we reject the “null hypothesis” and conclude that they are different based on a p (significance) less than 5%. There are basic assumptions that are required when we use the parametric paradigm. The distributions of the samples are the same, normality, the variances are the same, and errors are independent. Consequently, when comparing 2 samples, as for a placebo and a test drug, these assumptions must hold (which is inherent in the logistic regression). When we run quality control material, the confidence lines that we use are equivalent to a normal curve turned on its side. When doing the t-test, the parametric limitations have to be followed. A result of this is that a minimum of 40 samples are required because as N approaches 40 and over the fit of the data to a normal distribution is more likely. This is a daily phenomenon in laboratories globally – it takes about 10 – 14 days to be confident about the reference range for a new lot of quality control material, regardless of high, low or normal. Nevertheless, we have to ask whether we can use a small sample size to validate the reference range of a population sample. The answer is not so simple. One can minimize sampling bias by taking a sample of blood donors who are prescreened for serious medical conditions. The use of laboratory staff donors historically introduced selection bias when the staff was uniformly younger. On the other hand, the amount of computing power readily available to the average practitioner has substantially improved in the last 5 years, and middleware may offer a further opportunity for improvement. One can download a file with two weeks of results for any test and review and exclude outliers to the established values for the method. The substantial remaining sample has at least 1,000 patients to work with. Another method would use a nonparametric adjustment of the data by randomly removing a patient at a time and recalculating. We are not here concerned with distributional assumptions and population parameters. We work only with the data, and we observe the effects of recalculation. That is an uncommon and unfamiliar approach.

We proceed to the important problem of comparing 2 variables. Figure 1 is a bivariate plot of data with log(AST) and log(ALT) on each axis. The result is a scattergram with 95 and 99 percent confidence limits for a reference range formed from two liver tests that meet the parametric constraints. The scattergram shown in Figure 2 may show correlation, method A and method B, distinctly different, but having a linear association between them. The parametric assumption holds, and the confidence interval along the so called regression line is determined by ordinary least square regression (OLS). The subject of regression is a subject worthy of a separate topic.

The next topic is comparing two classes of subjects that we expect to be different because of effects on each group. This can be represented by the plot of means and standard deviations between patients with ovarian cancer who underwent chemotherapy and either had no or short remission, or had a remission of 20 months, defining treatment success (Figure 2). The result of means comparison is significant at p < 0.01 using the t-test (Figure 3). But what if we were to take the same data and compare the patients with no remission, small remission, and complete remission? One would do the one-way analysis of variance (ANOVA1), which uses the F test (Fisher’s variance ratio). F is the same as t squared, or t is the square root of F. The result would again be significant at p < 0.01.

This is a light review of very important methods used in both clinical and research laboratory studies. They have a history of widespread use going back at least 5 decades, and certainly in experimental physics before biology, although it is from biological observations that we have Fisher’s discriminant function, which gives a linear distance between classified variable, i.e., petal length and petal width. The discussion to follow will be concerned with tables and the chi squared distribution.