Posts Tagged ‘normal distribution’

Regression: A richly textured method for comparison and

classification of predictor variables

 The multivariable Case

Larry H. Bernstein, MD

 e-mail: plbern@yahoo.com.

Keywords:  bias correction, chi square, linear regression, logistic regression, loglinear analysis, multivariable regression, normal distribution, odds ratio, ordinal regression, regression methods


Multivariate statistical analysis is used to extend this analysis to two or more predictors.   In this case a multiple linear regression or a linear discriminant function would be used to predict a dependent variable from two or more independent variables.   If there is linear association dependency of the variables is assumed and the test of hypotheses requires that the variances of the predictors are normally distributed.  A method using a log-linear model circumvents the problem of the distributional dependency in a method called ordinal regression.    There is also a relationship of analysis of variance, a method of examining differences between the means of  two or more groups.  Then there is linear discriminant analysis, a method by which we examine the linear separation between groups rather than the linear association between groups.  Finally, the neural network is a nonlinear, nonparametric model for classifying data with several variables into distinct classes. In this case we might imagine a curved line drawn around the groups to divide the classes. The focus of this discussion will be  the use of linear regression  and explore other methods for classification purposes.


Multivariate statistical analysis extends regression analysis and introduces combinatorial analysis for two or more predictors.   Multiple linear regression or a linear discriminant function would be used to predict a dependent variable from two or more independent variables.   If there is linear association dependency of the variables is assumed and the test of hypotheses requires that the variances of the predictors are normally distributed.  Linear discriminant analysis examines the linear separation between groups rather than the linear association between groups, and it also requires adherence to distributional assumption. There is also a relationship of analysis of variance as a special case of linear regression, a method of examining differences between the means of two or more groups. A method using a log-linear model circumvents the problem of the distributional dependency in a method called ordinal regression.  Finally, the neural network is a nonlinear, nonparametric model for classifying data with several variables into distinct classes. In this case we might imagine a curved line drawn around the groups to divide the classes.

Regression analysis.

The use of linear regression, linear discriminant analysis and analysis of variance has to meet the following assumptions:

The variables compared are assumed to be independent measurements.

The correlation coefficient is a useful measure of the strength of the relationship between two variables only when the variables are linearly related.

The correlation coefficient is bounded in absolute value by 1.

All points on a straight line imply correlation of 1.

Correlation of 1 implies all points are on a straight line.

A high correlation coefficient between two variables does not imply a direct effect of one variable on another (both may be influenced by a hidden explanatory variable).

The correlation coefficient is invariant to linear transformations of either variable.

The correlation coefficient is also expressed as the covariance or product of the deviations of X and Y from their means standardized by dividing by the respective standard deviations.

These assumptions may be valid if the amount of data compared is very large, and if the data is parametric.  This is not necessarily the case.  There are also special applications in laboratory evaluations and crossover studies between methods and instruments that require correction for bias or for differences in the error variance term.

How do we measure the difference if there is any?  We use the t-test (19, 21).   If t is small than the null hypothesis is satisfied and no difference is detected in the means.   The conclusion is that the null hypothesis is accepted and the means are essentially the same .  However, the ability to accept or reject the null hypothesis is dependent on sample size, or power.  If the null hypothesis is rejected, bias has to be suspected.  This is useful when analyzing certain data, where the results of OLS are unsatisfactory. This test is here applied to linear increasing values of Y on X measured by A and B methods.   Of course the measurements are plotted and a line is fitted to the scatterplot.   OLS gives the fit of the line based on the least squares error, where the slope of the line is given by (20,22).

B = å (xi – mean x)yi .

å(xi – mean x)2

It is assumed that there are n pairs of values of x and y, and xi and yi denote the ith pair of values.   The slope defines the regression line of y on x.  An intercept that differs from zero is the bias.  It is worthwhile to mention that there is a difference here between the correlation measurement and the least squares fit of y on x.   We are measuring X by methods A and B.  We can then determine that the is a linear association with r valued between 0 and 1 (-values excluded).  In the case of the regression model, we are predicting B from A by plotting B on y from A on x.   Of course, experimentally, we are expecting the prediction to hold over a range of measurements, and the agreement drops off at some value of the coordinates (xi, yi).

Multiple regression is an extension of linear regression where the dependent variable is predicted by several independent variables.

In this case, the extended equation is (23)

Y = b0 + b1x1 + b2x2 + b3x3 …bnxn.

The model assumes a linear relationship between many predictor variables and the dependent variable.  The model usually assumes that the independent variables are not correlated with each other, which may not be the case.  The model can be tested by stepwise removal of predictor variables to assess their contribution to the model.     The model is  considered to be parametric, and so it requires that the inputs are normally distributed.  The bs (or betas) are also partial correlation coefficients.  The partial F test is the measure of the contribution of each variable after all the variables are in the equation.

Figures 1-3 are scatterplots of eGFR (glomerular filtration rate calculated by MDRD equation) and of hemoglobin vs Nt-proBNP, and a boxplot of Nt-proBNP by WHO criteria for anemia. Figure 2 is a 3D plot of NT-proBNP spliced by eGFR and hemoglobin.  The linear regression model is presented in Table I.  The correlation coefficient (R ) for the model is weak, but not insignificant. What do you think is the effect of the large variance in the dependent variable?   Figure 3 is a 3D plot of the eGFR and hemoglobin vs a transformed variable – age normalized 1000*Log(Nt-proBNP)/eGFR.  The variance is reduced on the transformed variable.  Table II is the regression model on the data.  The correlation coefficient R is improved.

Analysis of variance (ANOVA) and Analysis of covariance (ANCOVA)

ANOVA is used if the dependent variable is continuous and all of the independent variables are categorical.   One-way ANOVA is used for a single independent variable, and multi-way ANOVA is used for multiple independent variables.   The ANOVA is based on the general linear model.   The F-test is used to compare the difference between the means of groups.   The independent variable has discrete values is not used as a measure.   The t-test can be used between each pair in the groups.   The goal of ANOVA is to explain the total variation found in the study.   An example of this application is shown in Figure 4.

Figure 4.  BNP determined within ejection fraction above or within 40

Figure 5 is the means and 95% confidence intervals for a comparison of D-dimer and positive or negatine venous duplex scans.  There are only two variables so the corresponding ANOVA is one-way.  The F-value is high and corresponds to a high t in the t-test.  F is the same as t2 and p = 0.0001 (Table III).  Our interest here is in multiple variables so we’ll hold the discussion of difference testing between two variables.

Figure 5.

Table III.

If some of the independent variables are categorical (nominal, ordinal or dichotomous) and some are continuous ANCOVA is used.   The ANCOVA procedure first adjusts the dependent variable on the basis of the continuous independent variable and then does ANOVA on the adjusted dependent variable.

Generalized linear and generalized additive models

Generalized linear models transform the response by assuming that a transformation of the expected response is a linear function of the predictor variables.   The variance of the response is a function of the mean response.   When the relationship between the parameters is not linear, a generalized linear model can’t be used.   A generalized additive model can be used to fit nonlinear data-dependent functions of the predictor.   Tree-based models are used for exploratory analysis and are related to clustering, which is a method for studying the structure of the data, creating clusters of data with similar characteristics.

Discriminant analysis

The discriminant analysis is a modification of the general linear regression model.   The method is used to assign data to any of distinct classes as the dependent variable.   The linear regression model predicts based on a linear relationship between the dependent and the independent variables.   They are codependent.   In the discriminant function they are independent.   The function determines a separation between the classes to which the data assigns patients.   The goal is to assign a new incoming patient based on the independent variables to one of the different groups.   The mathematical function can be linear, quadratic, or another function.   The stepwise linear regression with removal or addition of variables is viewed in the same way.   However, the discriminant function produces a separation between the classes rather than through them.  The same qualifications for the method fit pertaining to distributional assumptions that applies to multiple linear regression applies to the linear discriminant function, but the analysis of data on congestive heart failure, renal insufficiency and anemia partitioned with NT-proBNP, creatinine, age and hemoglobin concentration shown in Figure 6 and Table IV uses a quadratic equation.  I re-classify the data using the transformed variable age-normalized 1000*Log(NT proBNP)/eGFR presented in Figure 7 and Table V.  The use of the logarithmic transform and removal of age and hemoglobin as predictors give impressive results.

Figure  6.

Table IV.

Figure 7.

Table V.

Mahalanobis D2

The euclidean distance between two coordinates having the position (x1y1), (x2y2) is given by the distance D = ([x1 – x2]2 + [y1 – y2]2)1/2.   This is generalized for N-dimensional space, and the square of the distance is D2.   The two points are the centroids in a cloud of points in space separated by D, the euclidean distance between the points in an N-dimensional space.   The multiplication of a vector and a variance-covariance matrix T-1 yields the linear discriminant functions.   The Mahalanobis distance can be used to evaluate the distances of centroids and also the distances of objects towards the centroid of their class.

Logistic regression

The linear probability model (logistic regression) is the standard regression model applied to data for which the dependent variable is dichotomous (0,1). It fits a logistic function to the dependent variables valued at 0 or 1 and estimates the probabilities associated with each observation (24).  The predicted values from the model are interpreted as a probability that the response is a 1.  The test of significance of the model is the Maximum Likelihood Estimator (MLE).  The significance is determined by adjusting the parameters to maximize the likelihood of the observed data arising from the linear sum of the variables.

There are problems in using the linear probability model (49).

The residuals don’t have a constant variance so that estimates from regression are not best linear unbiased, therefore, not minimum variance.

Standard errors of regression coefficients can be erroneous giving invalid confidence intervals.

The predicted values from regression can range outside the interval [0,1], whereas probabilities are bounded by that interval

The linearity assumption inherently imposes constraints on the marginal effects of predictor variables that are not taken into account by the OLS estimation.

The linearity assumption implies that the marginal effect of a predictor is constant across its range.

The usual r squared measure is problematic.

Ordinal regression

I now turn to the application of a special nonparametric regression program developed by Jay Magidson (GOLDmineR; Statistical Innovations Inc., Belmont, MA), referred to as Ordinal regression, or universal regression (25-28).   Let’s look at the application of this tool, which makes outcomes analysis easy.   This method brings a powerful tool to the analysis of laboratory data for clinical validation of diagnostic tests.  It overcomes serious limitations of logistic analysis when there is more than two possible outcomes to consider.   This has become more important as we introduce tests that have results that are affected by morbid conditions so that a range of probabilities might be associated with scaled “dummy values” of the test (possibly because of hidden or unspecified variables).

Ordinal dependent variables are multivalued and have an ordered relationship to the predictor variable(s).   Magidson (25-28), inspired by the work of Leo Goodman (29,30), suggests the existence of a single regression model that can accomodate dependent variables of any metric – dichotomous, ordinal, or continuous.   This supermodel holds true under the assumption of bivariate normality and under other distributional assumptions and subsumes linear distribution and logistic regression as special cases (25).    It uses a log odds model fit and the odds ratio is obtained from the log(odds ratio).   In the linear probability model, the coefficients (bi) are partial correlation coefficients.   In the logit model the coefficients are partial log(odds-ratio).

The monotonic regression of X on Y is described by:


E(Y|X = x) = å   Pj.x yj


Where Pj.x, the conditional probability of the occurrence of Y=yj (an ordinal dependent variable) given X=x (qualitative or quantitative predictor variables), is estimated from a sample of N observations using 2 steps.

1)      Conditional logits Yj.x are predicted using the generalized logit model, where Yj.x*  is: Yj.x = aj + (b1x1* +  b2x2*  + bMxM*)yj*    j= 1,2,…, J.
The Y-scores, which determine the ordering and relative spacing of the J outcomes, may be specified or if unspecified, they are treated as model parameters and estimated with other parameters.   Yj* , the relative Y-score, is the difference between yj and some Y-reference score y0 defined as a weighted average of the original Y-scores.

2)      The predicted logits are transformed to predicted probabilities using the identity:


Pj.x º exp(Yj.x)/å exp(Yj.x)


For a given X=x, the generalized logit is defined as

Yj.x º ln(Pj.x/P0.x)

where Pj.x is the conditional probability of the jth outcome occurring when X=x


and P0.x =  P (Pj.x)ej


I performed a nonparametric regression using the universal regression program GOLDminer, developed by Jay Magidson (25-28) at Statistical Innovations, Belmont, MA.  The universal regression program is a logistic regression if the dependent variable is a binary outcome, and it is a polytomous regression if there are more than two dependent variables, but it can accommodate a paired comparison of covariates.  The measure of association is phi and R2.  The measure of fit is L2 (chi square).  The logarithmic form transforms into a probability model, which we aren’t concerned with here.

Graphical Ordinal Logit Display (GOLDminer)

I have mentioned the nonparametric universal regression of Magidson (25-28), based on work with log-linear modeling with Prof. Leo Goodman (29,30).  The logistic regression and linear regression models can be viewed as special cases of this more general model.  This regression model has greatest use for examining structure in data where there are more than two dependent variables, and the independent variables are scaled to intervals (25-28).  The model is more general than the logistic regression and is not constrained by the conditions encountered with logistic regression identified above.

I cite a number of publications of its use in clinical laboratory outcomes analysis.

Example  1.  The association between predictors of nutrition risk and malnutrition risk

I use here data obtained by Linda Brugler and coworkers at St.FrancisHospital in Wilmington, DE (31) that examines association between the malnutrition assessed before intervention with three predictors of malnutrition risk.  Poor oral intake and malnutrition related diagnosis are categorical, and the laboratory-derived serum albumin is scaled to form an ordinal predictor.   The strength of the predictors is given by Table VI:

Table VI.  Ordinal regression model for combined 3 predictors of malnutrition risk.

The model is defined by the following:  L2 = 267.68, R2 = 0.405, phi = 1.1134,

Df (3, 42), p = 9.7e-58.

Example 2:  Ordinal regression for thalassemia risk

Table VII shows the odds-ratios for the combinatorial scaled results of Mentzer score (ratio of MCV: red cell count), MCV, and Hgb A2(e)(by electrophoresis is higher than by HPLC).  The presence of only a single positive test gives an unlikely result for thalassemia, while two or more positive tests give a high likelihood of thalassemia.   This is summarized as follows: 0,0,0-0,0,1-0,1,0-1,0,0 = 0; 1,1,1-1,0,1-0,1,1-1,1,0 = 1.

Table VIII.   Expected Odds Ratios – Diagnosis Thalassemia

Example 3. Ordinal regression for risk of newborn respiratory distress syndrome

A study by Kaplan, Chapman and coworkers (32) extending work by Bernstein and Rundell (33) looked at the relationship between gestational age and RDS of the newborn and used the ordinal regression model to predict expected outcomes (33).  Table IX gives probabilities for the prediction of risk.

Table IX.   Probabilities of RDS given by gestational age and S/A ratio.

Example 4.  Prediction of myocardial infarction risk by EKG and troponin T at 0.1 ng/ml

Bernstein, Zarich and Qamar (34) carried out a study in which the physicians were blinded to the troponin T results.  A randomized prospective study of over 800 patients followed (35-37).  The chest pain characteristics, EKG findings and troponin T results were reviewed for consecutive patients entered into the study (34).   EKG results were scaled as: negative, nonspecific, 0; ST depression or T wave inversion, 1, ST elevation or new Q-wave, 2.  Troponin T was scaled as follows: 0-0.075 ng/ml, 0; 0.076-0.099, 1; > 0.1.The diagnoses were as follows: noncardiac, cardiac and nonischemic, 1; Unstable angina with MI ruled out, 2; non ST or ST elevation MI, 3.  Table X is the table of odds ratios and probabilities.

Table X. Ordinal regression of EKG and troponin T on diagnoses

Ovarian Cancer Survival

Rosman and Schwartz have reported a relationship between CA125 post-chemotherapy of ovarian carcinomatosis and serum half-life of CA125.  We examined a published data set provided by Dr. Martin Rosman.  Data were analyzed from 55 women who were treated at YaleUniversity, had an evaluable CA125 half-life (t1/2), and were followed for disease recurrence for at least 3 years.  We modeled survival or remission for ovarian cancer using operative findings, stage, and CA125 halflife (46).  Figure 9 is a plot of the CA125 elimination half-life vs the Kullback-Liebler distance using the data provided by Dr. Martin Rosman. The K-L distance is the difference between the total entropy of the data in which association is removed and the observed entropy for each value of CA125.  The t1/2 is 10 days.  What Rudolph and Bernstein (43) have referred to as effective information is KL distance. This was done to determine the value of CA125 that best predicts survival.

Figure 9 CA125 halflife

The next step was to carry out a Kaplan Meier survival plot with Cox regression on the data vs the time to death or remission.  A survival of 30 months is considered a cure.  A survival less is considered a remission.  Some patients died only shortly into chemotherapy.   The study result is shown in Figure 11.

Figure 10.  Kaplan Meier plot

We also examined the associations between OPERATIVE FINDINGS and CA125 to REMISSION and NONREMISSION or RELAPSE using a universal regression model under bivariate normality with estimation of generalized odds-ratios developed by Jay Magidson (Statistical Innovations, Inc., Belmont, MA).  It uses a parallel log-odds model based on adjacent odds to describe the data.  The universal regression is carried out after scaling the continuous variables with intervals we determined as follows: halflife- 0-5, 6-10, 11-15, 16-20, >20.   A crosstabulation is constructed using the scaled variables as treatment vs. the effect (full, short remission or none), to obtain the frequency tabulation of treatment level vs remission, relapse or nonremission.

Table XI is a cross-tabulation of the observed and expected outcome frequencies in remission (rem), short remission (short,< 30 months) and non-remission (none) versus the scaled half-lives.   Relapse and failure to achieve remission were combined into one outcome class.  The means and standard error of the means (SEM) of half-life versus remission or non-remission/relapse are effectively separated (F=7.42, p < 0.01) as follows: Remission, 7.9, 2.8, [19];  Relapse/Non-remission, 17.4, 2.05, [36].

Table XII.  Observed and expected odds and odds-ratios of remission, relapse and no response by half-life

Perspective for the Future

Linear regression has been used extensively for methods comparison and for quality control, exclusively based on distributional assumptions and distance from the center of the population sample.   This is essential to analytical chemistry principles, but it has reached a limit.  The last 30 years has seen the development of very powerful regression tools that are not dependent on distributional assumptions and that move the method into classification and prediction.  The development of the Akaike Information Criterion (38-40) brought together two major disciplines that had separate developments, information theory and statistics.   The work by Bernstein et al. (41-42) in predicting myocardial infarction using bivariate density estimation, and with Kullback-Liebler Distance (43, 44), an extension of work by Rypka (45) is closely related. The use of tables and the scaling of data has been the dominant approach to statistics that uses ordinal and categorical data in outcomes research.  This has become a powerful method used in studies of placebo and drug effects.   The approach is readily amenable to studies of laboratory tests and outcomes.   Outcomes studies will be designed and carried out for laboratory tests that will ask questions appropriate for the clinical laboratory sciences, and that will not be subordinated to pharmaceutical evaluations, which currently have exclusion criteria that are inappropriate for laboratory investigations.


Regression has a long history in the development of modern science since the 18th century.  Regression has had a role in the emergence of physics, anthropology, psychology, and chemistry.  But its development was initially tied to linear association and assumption of normal distribution.   There are many associations that are tied to frequency of discrete events.  The use of chi-square as a measure of goodness of fit has such a tie to genetic analysis and to classification tables.   The importance of outcomes management and the recognition of a multivariable data structure that needs to be explored leads us to a new domain of regression models and includes an assumption that the dependent variable may not be know with certainty.  This is the case with the emerging models known as mixture models, structural equation models and latent class models.  This type of model is not traditionally a regression model and looks at defined variables and also unmeasured, hidden or latent variables (factors) in the model.  However, there are factor analysis and regression forms of the LCM that are included in the LCM software releases of Statistical Innovations, Inc. (Latent Gold). This important subject is beyond the scope of this review, but Demidenko (47) has written an excellent text on the subject.


19. Hoel PG. Elementary Statistics, Testing Hypotheses: The difference between two means. Chapter 3.3. pp133-117. 1960. Wiley, New York.

20. Hoel PG. Ibid. Regression. Chapter 9. pp141-153.

21. Norman GR, Streiner DL. Biostatistics: The Bare Essentials. Two repeated observations: The paired t-test and alternatives. Chapter 10. pp89-93. 2000, BC Deckker, Hamilton, Ont., Canada.

22. Norman GR, Streiner DL. Ibid. Simple regression and correlation. Chapter 13. pp118-126.

23. Norman GR, Streiner DL. Ibid. Multiple regression. Chapter 14. pp127-137.

24. Norman GR, Streiner DL.Ibid. Logistic regression. Chapter 15. pp139-144.

25.  Magidson J.  “Multivariate Statistical Models for Categorical Data,” Chapters 3 & 4   in Bagozzi R, Advanced Methods of Marketing Research, Blackwell, 1994.

26.  Magidson J. Introducing a new graphical method for the analysis of an ordered categorical response – Part I. Journal of Targeting, Measurement and Analysis for Marketing (UK). 1995; IV(2):133-148.

27.  Magidson J.  Introducing a new graphical model for the analysis of  an ordered categorical response – Part II. Ibid. 1996;IV(3):214-227.

28.  Magidson J.  Maximum likelihood assessment of clinical trials based on an ordered categorical response. Drug information Journal. 1996;30:143-170.

29.   Goodman LA.  Simple models for the analysis of associations in cross-  classifications having ordered categories.  Journal of the American Statistical Association. 1979;74: 537-552.  Reprinted in The Analysis of Cross-Classified Data Having Ordered Categories. 1984, HarvardUniversity Press.

30. Goodman LA.  Association models and the bivariate normal for contingency tables with ordered categories. Biometrika 1981;68:347-355.

31.Brugler L, Stankovic AK, Schlefer M, Bernstein L. A simplified nutrition screen for hospitalized patients using readily available laboratory and patient information. Nutrition 2005;21:650-658.

32. Kaplan LA, Chapman JF, Bock JL, Santa Maria E, Clejan S, et al. Prediction of respiratory distress syndrome using the Abbott FLM-II amniotic fluid assay. Clin Chim Acta 2002;326[1-2]:61-68.

33.  Bernstein LH, Stiller R, Menzies C, McKenzie M, Rundell C. Amniotic fluid    polarization of fluorescence and lecithin/sphingomyelin ratio decision criteria assessed. Yale J Biol Med 1995; 68(2):101-117.

34.  Bernstein LH, Qamar A, McPherson C, Zarich S.   Evaluating a new graphical   ordinal logit method (GOLDminer) in the diagnosis of myocardial infarction utilizing clinical features and laboratory data.   Yale J Biol Med 1999; 72:259-268.

35. Bernstein L, Bradley K, Zarich S. GOLDmineR: Improving Models for Classifying Patients with Chest Pain. Yale J Biol Med 2002;75: 183-198.

36. Zarich S, Bradley K, Seymour J, Ghali W, Traboulsi A, et al. Impact of troponin T determinations on hospital resources and costs in the evaluation of patients with suspected myocardial ischemia. Amer J Cardiol 2001;88:732-6.

37. Zarich SW, Qamar AU, Werdmann MJ, Lizak LS, McPhersonCA, Bernstein LH. Value of a single troponin T at the time of presentation as compared to serial CK-MB determinations in patients with suspected myocardial ischemia. Clin Chim Acta 2002;326:185-192.

38. Akaike H. Information theory and an extension of maximum likelihood principle.    In B.N. Petrov and F. Csake (eds.), Second International Symposium on Information Theory. 1973, Akademiai Kiado, pp 267-281, Budapest.

39. Akaike H. A new look at the statistical model identification.  IEEE Transactions on Automation Control, AC-19, 1974; 716-723.

40. Dayton CM. Information Criteria for the Paired-Comparisons Problem.  American Statistician. 1998;52: 144-151.

41. Bernstein LH, Good IJ, Holtzman GI, Deaton ML, Babb J:  Diagnosis of myocardial infarction from two enzyme measurements of creatine kinase isoenzyme MB with use of nonparametric probability estimation.  Clin Chem 1989;35:444-7.

42. Bernstein LH, Good IJ, Holtzman GI, Deaton ML, and Babb J. Diagnosis of heart attack from two enzyme measurements by means of bivariate probability density estimation: statistical details. J Statistical Computation and Simulation. 1989.

43. Rudolph RA, Bernstein LH, Babb J. Information-induction for the diagnosis of myocardial infarction. Clin Chem 1988;34:2031-8.

44. Kullback S, Liebler RA. On information and sufficiency. Ann Mathematical Statistics 1951;22:79-86.

45. Rypka EW. Methods to evaluate and develop the decision process in the selection of tests. Clinics in Laboratory Med 1992;12[2]: 351-385.

46. Bernstein LH. Outcomes-based Decision Support: How to Link Laboratory Utilization to Clinical Endpoints. Chapter 8. Pp91-128. In Bissell MG, ed. Laboratory-Related Measures of Patient Outcomes: An Introduction. 2000. AACC Press. Washington, DC.

47. Demidenko E.  Mixture models: Theory and applications.  2004.  Wiley-Interscience. Hoboken, NJ.

48. Martin RF. General Deming regression for estimating systematic bias and confidence interval in method-comparison studies. Clin Chem 2000;46:100-104.

49. Magidson J.  Opportunities grow on trees. A general alternative to linear regression. Monotonic regression of dichotomous, ordinal and grouped continuous dependent variables.  1998. Statistical Innovations, Inc. Belmont, MA.

 Figures and Tables Version 8 Multivariable

Table I.  Regression of eGFR and hemoglobin to predict Nt-proBNP

Step number : 0
R : 0.376
R-square : 0.141


In Effect Coefficient Standard Error Std.
Tolerance df F-ratio p-value
1 Constant
2 eGFR -83.499 14.063 -0.297 0.951 1 35.256 0.000
3 Hgb -910.224 260.436 -0.175 0.951 1 12.215 0.001

Information Criteria

AIC 7785.028
AIC (Corrected) 7785.139
Schwarz’s BIC 7800.628


Dependent Variable NTproBNP
N 365
Multiple R 0.376
Squared Multiple R 0.141
Adjusted Squared Multiple R 0.137
Standard Error of Estimate 10287.156

Analysis of Variance

Source SS df Mean Squares F-ratio p-value
Regression 6.309E+009 2 3.155E+009 29.809 0.000
Residual 3.831E+010 362 1.058E+008

Table II. Linear regression of NKLog(Nt-proBNP0/eGFR by eGFR and hemoglobin

Log transform flattens the high Nt-proBNP scale and eGFR and age are normalized







In Effect Coefficient Standard Error Std.
Tolerance df F-ratio p-value
1 Constant
2 eGFR -1.873 0.144 -0.573 0.933 1 170.011 0.000
3 Hgb -4.259 2.436 -0.077 0.933 1 3.056 0.081

Information Criteria

AIC 4299.786
AIC (Corrected) 4299.899
Schwarz’s BIC 4315.331


Dependent Variable NKLogNTGFR
N 360
Multiple R 0.597
Squared Multiple R 0.357
Adjusted Squared Multiple R 0.353
Standard Error of Estimate 94.260

Regression Coefficients B = (X’X)-1X’Y

Effect Coefficient Standard Error Std.
Tolerance t p-value
CONSTANT 256.151 27.745 0.000 . 9.232 0.000
MDRD_GFR -1.873 0.144 -0.573 0.933 -13.039 0.000
Hgb -4.259 2.436 -0.077 0.933 -1.748 0.081

Table III. One-way ANOVA of D-dimer for positive and negative scans

Dependent Variable D_DIMER
N 817

Analysis of Variance

Source Type III SS df Mean Squares F-ratio p-value
VENDUP 43456570.851 1 43456570.851 68.278 0.000
Error 5.187E+008 815 636461.763

Table 4.   Discriminant function for CHF, renal insufficiency and anemia by age, NT-proBNP, creatinine and hemoglobin

Group Frequencies
0 1 2
135 335 235
Group Means
  0 1 2
NTproBNP (pg/ml) 1516.369 5964.054 12902.662
Creatinine 0.716 1.654 2.103
Hgb 11.972 11.533 11.305
Age 60.570 71.373 74.966
Between Groups F-matrix
df : 4 699
  0 1 2
0 0.000
1 23.445 0.000
2 45.108 11.788 0.000

Wilks’s Lambda







Approx. F-ratio










Classification Functions
  0 1 2
CONSTANT -32.018 -35.196 -37.394


F-to-remove Tolerance
5 NTproBNP
13.489 0.801
6 Creatinine 21.368 0.799
7 Hgb 0.190 0.928
3 Age 38.632 0.948
Test Statistic
Statistic Value Approx. F-ratio


Wilks’s Lambda 0.778 23.337 8 1398 0.000
Pillai’s Trace 0.226 22.295 8 1400 0.000
Lawley-Hotelling Trace 0.279 24.382 8 1396 0.000

Table V.  The DFA calculations for Figure 9.

Group Frequencies
  0 1 2
221.000 631.000 571.000
NKLgNTproGFRe 15.589 55.971 81.159
MDRD 123.130 61.940 48.748
Group 0 Discriminant Function Coefficients
MDRDest Constant
NKLgNTproGFRe -0.015
MDRD -0.001 0.000
Constant 0.588 0.052 -15.590
Group 1 Discriminant Function Coefficients
MDRDest Constant
NKLgNTproGFRe 0.000
MDRD 0.000 -0.001
Constant 0.024 0.089 -12.106
Group 2 Discriminant Function Coefficients
MDRDest Constant
NKLgNTproGFRe 0.000
MDRD 0.000 -0.001
Constant 0.015 0.147 -13.077
Between Groups F-matrix
df : 2 1419
  0 1 2
0 0.000
1 236.650 0.000
2 335.228 21.342 0.000

Wilks’s Lambda for the Hypothesis







Approx. F-ratio










Classification Matrix (Cases in row categories classified into columns)
  0 1 2 %correct
0 206 15 0 93
1 237 363 31 58
2 69 459 43 8
Total 512 837 74 43
Jackknifed Classification Matrix
  0 1 2 %correct
0 205 16 0 93
1 237 363 31 58
2 69 462 40 7
Total 511 841 71 43
Test Statistic
Statistic Value Approx. F-ratio


Wilks’s Lambda 0.671 156.542 4 2838 0.000
Pillai’s Trace 0.330 140.347 4 2840 0.000
Lawley-Hotelling Trace 0.488 173.026 4 2836 0.000
Canonical Discriminant Functions
  1 2
Constant -1.912 -1.075
NKLgNTproGFRe 0.001 0.009
MDRD 0.028 0.008
Canonical Discriminant Functions : Standardized by Within Variances
  1 2
NKLgNTproGFRe 0.085 1.061
MDRD 1.026 0.284
Canonical Scores of Group Means
  1 2
0 1.576 0.034
1 -0.122 -0.069
2 -0.476 0.063

Table VI  Ordinal regression model for combined 3 predictors of malnutrition risk.

Predictor                                              L2                     p                      exp(beta)

Poor oral intake                                    60.29               8.2e-15              5.3

Malnutrition related condition    46.29               1.0e-11              3.06

Albumin                                                152.01             6.3e-35              3.16

Table VII.   Expected Odds Ratios – Diagnosis Thalassemia


Me,M,A2(e)                 Thalassemia

1,1,1                                 9713

1,1,0                                 1696

1,0,1                                   263

0,1,1                                   212          

1,0,0                                     46

0,1,0                                     37

0,0,1                                       6

0,0,0                                       1

Table VIII.   Probabilities of RDS given by gestational age and S/A ratio.

Dependent variable: Respiratory outcome (Resp_Sca)

Predictors: Surfactant to albumin (S/A) Ratio_45: 0, > 45; 1, 21-44; 2, < 21;

Gestational age at delivery: 0, > 36; 1, 34-36; 2, < 34.

S/A Ratio_45               p = 8.7*10-22

Gestational Age at Delivery Scaled        p = 4.2*10-9

Combined variables: ChiSq = 130.14,   p = 5.1*10-28,   R2 = 0.433,   phi = 0.8231,   exp(beta) = 2.16 (S/A),   1.88 (GA)

Definition (S/A, GA) Exp. Probabilities Exp. Odds-Ratios
0-20, < 34 0.84 4427
0-20, 34-36 0.64 668
21-44, < 34 0.57 441
0-20, > 36 0.31 101
21-44, 34-36 0.25 67
> 45, < 34 0.19 44
21-44, > 36 0.06 10
> 45, 34-36 0.04 7
> 45, > 36 0.01 1

Table IX. Ordinal regression of EKG and troponin T on diagnoses

Association Summary               L²                     df         p-value             R²        phi

Explained by Model                  206.52             2          1.4e-45            0.686   1.3856

Residual                                         48.64               14        1.0e-5

Total                                               255.16             16        4.5e-45

Odds Ratios and probabilities for diagnoses

average                        1                2                              0          1          2

score                            0.00        0.00

2,3       2.87                             466.82   10086.03         0.01     0.11     0.88

2,2       2.67                             105.78    1087.95           0.04     0.20     0.75

1,3       2.64                             95.35          931.05            0.05     0.21     0.74

2,1       1.95                             23.97          117.35            0.26     0.27     0.47

1,2       1.87                             21.61           100.43           0.29     0.26     0.45

0,3       1.79                             19.48             85.95           0.32     0.26     0.42

1,1       0.67                             4.90                10.83          0.73     0.15     0.12

0,2       0.61                             4.41                   9.27          0.75     0.14     0.11

0,1       0.12                             1.00                 1.00            0.95     0.04     0.01

Table X.  Observed and expected odds and odds-ratios of remission, relapse and no response by half-life

Half-life          exp. odds     exp. odds-ratios

(range, days)      Rem    short    none     Rem   short  none

> 20                           1      4.16    17.11      1    12.49  56.07

16-20                         1      2.21     4.84      1     6.64  44.16

11-15                          1      1.18     1.37      1     3.53  12.49

6-10                            1     0.63     0.39      1     1.88   3.53

< 6                               1     0.33     0.11       1       1         1

HL-ref                         1    0.33      0.11       1       1         1

Figure 1.  log_NT-proBNP vs eGRF

Figure 2.   Boxplots of NT-proBNP and WHO criteria

Figure 3.  NT-proBNP vs Hb

Figure 4.   3D plot of NT-proBNP, MDRD eGFR, Hb

Figure 5.   3D plot of Normalized K*Log_NTproBNP/eGFR, eGFR, Hb

Figure  6. D-dimer Confidence Intervals vs Imaging

Figures 7 & 8.   Canonical Scores Plots

Figure 9.  Entropy Plot of CA125 halflife (x) vs Effective Information
(Kullback Entropy) showing sharp drop in Entropy at 10 days (equivalent to information added to resolve uncertainty).  AS developed by Rosser R Rudolph

Figure 10.  Kaplan Meier Plot of CA125 half-life vs Survival in Ovarian Cancer

Read Full Post »

Normality and the Parametric Paradigm

Larry H. Bernstein, MD


This article is about the measure of central tendancy and dispersion of values around the center (mean or median), the underpinning of parametric methods of comparison of 2 or more sets of data.  More importantly, it is the beginning of a statistical journey.  The clinical laboratory deals with large volumes of patient data.  The use of a parametric approach is limited and is prone to problems introduced in the clinical domain.  Consequently, Galen and Gambino introduced the concept of predictive value and the effect of prevalence in a Bayesian context in Beyond Normality. These calculations work off of tables, the same tables that are used for sensitivity and specificity, and are used to calculate a chi squared probability.  The subsequent influence of epidemiology went further in introducing odds and odds ratios.  The third and last article will address the more recent advances beyond, beyond normality.  These improvements have all come about by the development of a powerful statistical methodology that is not constraint by the parametric paradigm and is well developed for hypothesis generation and validation, not just testing of simple hypotheses.

We have grown up with the normal curve and have incorporated it in our thinking, not just our work.  Even the use of the term Six Sigma for reduction of errors has reference to the classical “normal curve” introduced by Johann Carl Friedrich Gauss (1777–1855).  The normal or “bell shaped” curve is a plot of numerical values along the x-axis and the frequency of the occurrence on the y-axis.  If the set of measurements occurs as a random and independent event, we refer to this as parametric, and the distribution of the values is a bell shaped curve with all but 2.5% of the values included within both ends, with the mean or arithmetic average at the center, and with 67% of the sample contained within 1 standard deviation of the mean.   The reference to normality has been used with respect to student test scores, with respect to coin flipping and games of chance, with respect to investment, and in our experience with respect to errors of quality controlled measurements.  The expected value we refer to as the mean (closest to the true value), and the distance from the mean (or scatter) we refer to as dispersion, measured as the standard deviation.  Viewed in this light, we can convert the curve from a standard curve with an actual mean to a standard normal curve with a mean at the center of “0”, and with distances from “0” in standard deviations.   A bad example of this is the distribution of serum AST measurements of a large unselected population enrolled in a clinical trial.  The AST values tend to have many high values, which we call skewness to the right of the curve, so the behavior we are looking for is better described by a log transformation of the values to minimize nonlinearities in the measurement.  This is illustrated by the comparison of AST and log(AST) in Figure 1.

What has not been said is that we view a reference range in terms of a homogeneous population.  This means that while all values might not be the same, the values are scattered within a distance from the mean that becomes less frequent as the distance is larger so that we can describe a mean and a 95% confidence interval around the mean.  In mineralogy we can measure physical elements that have structure defined by a relationship of structure to spectral lines.  Hence, the scatter about the mean is very small because of the precise measurements, even though the quantity may be very small.  This is not necessarily the case with clinical laboratory measurement because of hidden variables, such as – age, diurnal variation, racial factors, and disease.  One way to level the playing field is to compose uniform specimens for quality control that are representative of a population for comparison of laboratory measurements among many laboratories, which is established practice.  What is assumed is that a “normal” population is that population that is found after we remove bias, or contamination of the population by the hidden variable effects mentioned above.  Therefore, parametric statistics is actually a comparison of one or more populations that are to be compared with the hypothetical normal population.   The test of significance is a comparison of A and B with the assumption that they are sampled from the same population, but when they are found to have different means and confidence intervals by a t-test or an analysis of variance, we reject the “null hypothesis” and conclude that they are different based on a p (significance) less than 5%.   There are basic assumptions that are required when we use the parametric paradigm.   The distributions of the samples are the same, normality, the variances are the same, and errors are independent.  Consequently, when comparing 2 samples, as for a placebo and a test drug, these assumptions must hold (which is inherent in the logistic regression).   When we run quality control material, the confidence lines that we use are equivalent to a normal curve turned on its side.  When doing the t-test, the parametric limitations have to be followed.  A result of this is that a minimum of 40 samples are required because as N approaches 40 and over the fit of the data to a normal distribution is more likely.  This is a daily phenomenon in laboratories globally – it takes about 10 – 14 days to be confident about the reference range for a new lot of quality control material, regardless of high, low or normal.  Nevertheless, we have to ask whether we can use a small sample size to validate the reference range of a population sample.  The answer is not so simple.  One can minimize sampling bias by taking a sample of blood donors who are prescreened for serious medical conditions.  The use of laboratory staff donors historically introduced selection bias when the staff was uniformly younger. On the other hand, the amount of computing power readily available to the average practitioner has substantially improved in the last 5 years, and middleware may offer a further opportunity for improvement.  One can download a file with two weeks of results for any test and review and exclude outliers to the established values for the method.  The substantial remaining sample has at least 1,000 patients to work with.  Another method would use a nonparametric adjustment of the data by randomly removing a patient at a time and recalculating. We are not here concerned with distributional assumptions and population parameters. We work only with the data, and we observe the effects of recalculation.   That is an uncommon and unfamiliar approach.

We proceed to the important problem of comparing 2 variables.  Figure 1 is a bivariate plot of data with log(AST) and log(ALT) on each axis.  The result is a scattergram with 95 and 99 percent confidence limits for a reference range formed from two liver tests that meet the parametric constraints.   The scattergram shown in Figure 2 may show correlation, method A and method B, distinctly different, but having a linear association between them.  The parametric assumption holds, and the confidence interval along the so called regression line is determined by ordinary least square regression (OLS).  The subject of regression is a subject worthy of a separate topic.

The next topic is comparing two classes of subjects that we expect to be different because of effects on each group.  This can be represented by the plot of means and standard deviations between patients with ovarian cancer who underwent chemotherapy and either had no or short remission, or had a remission of 20 months, defining treatment success (Figure 2).   The result of means comparison is significant at p < 0.01 using the t-test (Figure 3).   But what if we were to take the same data and compare the patients with no remission, small remission, and complete remission?  One would do the one-way analysis of variance (ANOVA1), which uses the F test (Fisher’s variance ratio).  F is the same as t squared, or t is the square root of F.  The result would again be significant at p < 0.01.

This is a light review of very important methods used in both clinical and research laboratory studies.  They have a history of widespread use going back at least 5 decades, and certainly in experimental physics before biology, although it is from biological observations that we have Fisher’s discriminant function, which gives a linear distance between classified variable, i.e., petal length and petal width.  The discussion to follow will be concerned with tables and the chi squared distribution.

Read Full Post »