Regression: A richly textured method for comparison and classification of predictor variables: The multivariable Case
Author: Larry H. Bernstein, MD
e-mail: plbern@yahoo.com.
Keywords: bias correction, chi square, linear regression, logistic regression, loglinear analysis, multivariable regression, normal distribution, odds ratio, ordinal regression, regression methods
Abstract
Multivariate statistical analysis is used to extend this analysis to two or more predictors. In this case a multiple linear regression or a linear discriminant function would be used to predict a dependent variable from two or more independent variables. If there is linear association dependency of the variables is assumed and the test of hypotheses requires that the variances of the predictors are normally distributed. A method using a log-linear model circumvents the problem of the distributional dependency in a method called ordinal regression. There is also a relationship of analysis of variance, a method of examining differences between the means of two or more groups. Then there is linear discriminant analysis, a method by which we examine the linear separation between groups rather than the linear association between groups. Finally, the neural network is a nonlinear, nonparametric model for classifying data with several variables into distinct classes. In this case we might imagine a curved line drawn around the groups to divide the classes. The focus of this discussion will be the use of linear regression and explore other methods for classification purposes.
Introduction
Multivariate statistical analysis extends regression analysis and introduces combinatorial analysis for two or more predictors. Multiple linear regression or a linear discriminant function would be used to predict a dependent variable from two or more independent variables. If there is linear association dependency of the variables is assumed and the test of hypotheses requires that the variances of the predictors are normally distributed. Linear discriminant analysis examines the linear separation between groups rather than the linear association between groups, and it also requires adherence to distributional assumption. There is also a relationship of analysis of variance as a special case of linear regression, a method of examining differences between the means of two or more groups. A method using a log-linear model circumvents the problem of the distributional dependency in a method called ordinal regression. Finally, the neural network is a nonlinear, nonparametric model for classifying data with several variables into distinct classes. In this case we might imagine a curved line drawn around the groups to divide the classes.
Regression analysis.
The use of linear regression, linear discriminant analysis and analysis of variance has to meet the following assumptions:
The variables compared are assumed to be independent measurements.
The correlation coefficient is a useful measure of the strength of the relationship between two variables only when the variables are linearly related.
The correlation coefficient is bounded in absolute value by 1.
All points on a straight line imply correlation of 1.
Correlation of 1 implies all points are on a straight line.
A high correlation coefficient between two variables does not imply a direct effect of one variable on another (both may be influenced by a hidden explanatory variable).
The correlation coefficient is invariant to linear transformations of either variable.
The correlation coefficient is also expressed as the covariance or product of the deviations of X and Y from their means standardized by dividing by the respective standard deviations.
These assumptions may be valid if the amount of data compared is very large, and if the data is parametric. This is not necessarily the case. There are also special applications in laboratory evaluations and crossover studies between methods and instruments that require correction for bias or for differences in the error variance term.
How do we measure the difference if there is any? We use the t-test (19, 21). If t is small than the null hypothesis is satisfied and no difference is detected in the means. The conclusion is that the null hypothesis is accepted and the means are essentially the same . However, the ability to accept or reject the null hypothesis is dependent on sample size, or power. If the null hypothesis is rejected, bias has to be suspected. This is useful when analyzing certain data, where the results of OLS are unsatisfactory. This test is here applied to linear increasing values of Y on X measured by A and B methods. Of course the measurements are plotted and a line is fitted to the scatterplot. OLS gives the fit of the line based on the least squares error, where the slope of the line is given by (20,22).
B = å (x_{i} – mean x)y_{i} .
å(x_{i} – mean x)^{2}
It is assumed that there are n pairs of values of x and y, and x_{i} and y_{i} denote the ith pair of values. The slope defines the regression line of y on x. An intercept that differs from zero is the bias. It is worthwhile to mention that there is a difference here between the correlation measurement and the least squares fit of y on x. We are measuring X by methods A and B. We can then determine that the is a linear association with r valued between 0 and 1 (-values excluded). In the case of the regression model, we are predicting B from A by plotting B on y from A on x. Of course, experimentally, we are expecting the prediction to hold over a range of measurements, and the agreement drops off at some value of the coordinates (x_{i}, y_{i}).
Multiple regression is an extension of linear regression where the dependent variable is predicted by several independent variables.
In this case, the extended equation is (23)
Y = b_{0} + b_{1}x_{1} + b_{2}x_{2} + b_{3}x_{3} …b_{n}x_{n}.
The model assumes a linear relationship between many predictor variables and the dependent variable. The model usually assumes that the independent variables are not correlated with each other, which may not be the case. The model can be tested by stepwise removal of predictor variables to assess their contribution to the model. The model is considered to be parametric, and so it requires that the inputs are normally distributed. The bs (or betas) are also partial correlation coefficients. The partial F test is the measure of the contribution of each variable after all the variables are in the equation.
Figures 1-3 are scatterplots of eGFR (glomerular filtration rate calculated by MDRD equation) and of hemoglobin vs Nt-proBNP, and a boxplot of Nt-proBNP by WHO criteria for anemia. Figure 2 is a 3D plot of NT-proBNP spliced by eGFR and hemoglobin. The linear regression model is presented in Table I. The correlation coefficient (R ) for the model is weak, but not insignificant. What do you think is the effect of the large variance in the dependent variable? Figure 3 is a 3D plot of the eGFR and hemoglobin vs a transformed variable – age normalized 1000*Log(Nt-proBNP)/eGFR. The variance is reduced on the transformed variable. Table II is the regression model on the data. The correlation coefficient R is improved.
Analysis of variance (ANOVA) and Analysis of covariance (ANCOVA)
ANOVA is used if the dependent variable is continuous and all of the independent variables are categorical. One-way ANOVA is used for a single independent variable, and multi-way ANOVA is used for multiple independent variables. The ANOVA is based on the general linear model. The F-test is used to compare the difference between the means of groups. The independent variable has discrete values is not used as a measure. The t-test can be used between each pair in the groups. The goal of ANOVA is to explain the total variation found in the study. An example of this application is shown in Figure 4.
Figure 4. BNP determined within ejection fraction above or within 40
Figure 5 is the means and 95% confidence intervals for a comparison of D-dimer and positive or negatine venous duplex scans. There are only two variables so the corresponding ANOVA is one-way. The F-value is high and corresponds to a high t in the t-test. F is the same as t^{2} and p = 0.0001 (Table III). Our interest here is in multiple variables so we’ll hold the discussion of difference testing between two variables.
Figure 5.
Table III.
If some of the independent variables are categorical (nominal, ordinal or dichotomous) and some are continuous ANCOVA is used. The ANCOVA procedure first adjusts the dependent variable on the basis of the continuous independent variable and then does ANOVA on the adjusted dependent variable.
Generalized linear and generalized additive models
Generalized linear models transform the response by assuming that a transformation of the expected response is a linear function of the predictor variables. The variance of the response is a function of the mean response. When the relationship between the parameters is not linear, a generalized linear model can’t be used. A generalized additive model can be used to fit nonlinear data-dependent functions of the predictor. Tree-based models are used for exploratory analysis and are related to clustering, which is a method for studying the structure of the data, creating clusters of data with similar characteristics.
Discriminant analysis
The discriminant analysis is a modification of the general linear regression model. The method is used to assign data to any of distinct classes as the dependent variable. The linear regression model predicts based on a linear relationship between the dependent and the independent variables. They are codependent. In the discriminant function they are independent. The function determines a separation between the classes to which the data assigns patients. The goal is to assign a new incoming patient based on the independent variables to one of the different groups. The mathematical function can be linear, quadratic, or another function. The stepwise linear regression with removal or addition of variables is viewed in the same way. However, the discriminant function produces a separation between the classes rather than through them. The same qualifications for the method fit pertaining to distributional assumptions that applies to multiple linear regression applies to the linear discriminant function, but the analysis of data on congestive heart failure, renal insufficiency and anemia partitioned with NT-proBNP, creatinine, age and hemoglobin concentration shown in Figure 6 and Table IV uses a quadratic equation. I re-classify the data using the transformed variable age-normalized 1000*Log(NT proBNP)/eGFR presented in Figure 7 and Table V. The use of the logarithmic transform and removal of age and hemoglobin as predictors give impressive results.
Figure 6.
Table IV.
Figure 7.
Table V.
Mahalanobis D^{2}
The euclidean distance between two coordinates having the position (x_{1}y_{1}), (x_{2}y_{2}) is given by the distance D = ([x_{1} – x_{2}]^{2} + [y_{1} – y_{2}]^{2})^{1/2}. This is generalized for N-dimensional space, and the square of the distance is D^{2}. The two points are the centroids in a cloud of points in space separated by D, the euclidean distance between the points in an N-dimensional space. The multiplication of a vector and a variance-covariance matrix T^{-1} yields the linear discriminant functions. The Mahalanobis distance can be used to evaluate the distances of centroids and also the distances of objects towards the centroid of their class.
Logistic regression
The linear probability model (logistic regression) is the standard regression model applied to data for which the dependent variable is dichotomous (0,1). It fits a logistic function to the dependent variables valued at 0 or 1 and estimates the probabilities associated with each observation (24). The predicted values from the model are interpreted as a probability that the response is a 1. The test of significance of the model is the Maximum Likelihood Estimator (MLE). The significance is determined by adjusting the parameters to maximize the likelihood of the observed data arising from the linear sum of the variables.
There are problems in using the linear probability model (49).
The residuals don’t have a constant variance so that estimates from regression are not best linear unbiased, therefore, not minimum variance.
Standard errors of regression coefficients can be erroneous giving invalid confidence intervals.
The predicted values from regression can range outside the interval [0,1], whereas probabilities are bounded by that interval
The linearity assumption inherently imposes constraints on the marginal effects of predictor variables that are not taken into account by the OLS estimation.
The linearity assumption implies that the marginal effect of a predictor is constant across its range.
The usual r squared measure is problematic.
Ordinal regression
I now turn to the application of a special nonparametric regression program developed by Jay Magidson (GOLDmineR; Statistical Innovations Inc., Belmont, MA), referred to as Ordinal regression, or universal regression (25-28). Let’s look at the application of this tool, which makes outcomes analysis easy. This method brings a powerful tool to the analysis of laboratory data for clinical validation of diagnostic tests. It overcomes serious limitations of logistic analysis when there is more than two possible outcomes to consider. This has become more important as we introduce tests that have results that are affected by morbid conditions so that a range of probabilities might be associated with scaled “dummy values” of the test (possibly because of hidden or unspecified variables).
Ordinal dependent variables are multivalued and have an ordered relationship to the predictor variable(s). Magidson (25-28), inspired by the work of Leo Goodman (29,30), suggests the existence of a single regression model that can accomodate dependent variables of any metric – dichotomous, ordinal, or continuous. This supermodel holds true under the assumption of bivariate normality and under other distributional assumptions and subsumes linear distribution and logistic regression as special cases (25). It uses a log odds model fit and the odds ratio is obtained from the log(odds ratio). In the linear probability model, the coefficients (b_{i}) are partial correlation coefficients. In the logit model the coefficients are partial log(odds-ratio).
The monotonic regression of X on Y is described by:
J
E(Y|X = x) = å P_{j.x} y_{j}
J=1
Where P_{j.x}, the conditional probability of the occurrence of Y=y_{j} (an ordinal dependent variable) given X=x (qualitative or quantitative predictor variables), is estimated from a sample of N observations using 2 steps.
1) Conditional logits Y_{j.x} are predicted using the generalized logit model, where Y_{j.x* } is: Y_{j.x} = a_{j} + (b_{1}x_{1*} + b_{2}x_{2*} + b_{M}x_{M*})y_{j*} j= 1,2,…, J.
The Y-scores, which determine the ordering and relative spacing of the J outcomes, may be specified or if unspecified, they are treated as model parameters and estimated with other parameters. Y_{j*} , the relative Y-score, is the difference between y_{j} and some Y-reference score y_{0} defined as a weighted average of the original Y-scores.
2) The predicted logits are transformed to predicted probabilities using the identity:
J
P_{j.x} º exp(Y_{j.x})/å exp(Y_{j.x})
J=1
For a given X=x, the generalized logit is defined as
Y_{j.x} º ln(P_{j.x}/P_{0.x})
where P_{j.x} is the conditional probability of the jth outcome occurring when X=x
J
and P_{0.x} = P (P_{j.x})^{e}_{j}
j=1
I performed a nonparametric regression using the universal regression program GOLDminer, developed by Jay Magidson (25-28) at Statistical Innovations, Belmont, MA. The universal regression program is a logistic regression if the dependent variable is a binary outcome, and it is a polytomous regression if there are more than two dependent variables, but it can accommodate a paired comparison of covariates. The measure of association is phi and R^{2}. The measure of fit is L^{2} (chi square). The logarithmic form transforms into a probability model, which we aren’t concerned with here.
Graphical Ordinal Logit Display (GOLDminer)
I have mentioned the nonparametric universal regression of Magidson (25-28), based on work with log-linear modeling with Prof. Leo Goodman (29,30). The logistic regression and linear regression models can be viewed as special cases of this more general model. This regression model has greatest use for examining structure in data where there are more than two dependent variables, and the independent variables are scaled to intervals (25-28). The model is more general than the logistic regression and is not constrained by the conditions encountered with logistic regression identified above.
I cite a number of publications of its use in clinical laboratory outcomes analysis.
Example 1. The association between predictors of nutrition risk and malnutrition risk
I use here data obtained by Linda Brugler and coworkers at St.FrancisHospital in Wilmington, DE (31) that examines association between the malnutrition assessed before intervention with three predictors of malnutrition risk. Poor oral intake and malnutrition related diagnosis are categorical, and the laboratory-derived serum albumin is scaled to form an ordinal predictor. The strength of the predictors is given by Table VI:
Table VI. Ordinal regression model for combined 3 predictors of malnutrition risk.
The model is defined by the following: L^{2} = 267.68, R^{2} = 0.405, phi = 1.1134,
Df (3, 42), p = 9.7e^{-58}.
Example 2: Ordinal regression for thalassemia risk
Table VII shows the odds-ratios for the combinatorial scaled results of Mentzer score (ratio of MCV: red cell count), MCV, and Hgb A_{2}(e)(by electrophoresis is higher than by HPLC). The presence of only a single positive test gives an unlikely result for thalassemia, while two or more positive tests give a high likelihood of thalassemia. This is summarized as follows: 0,0,0-0,0,1-0,1,0-1,0,0 = 0; 1,1,1-1,0,1-0,1,1-1,1,0 = 1.
Table VIII. Expected Odds Ratios – Diagnosis Thalassemia
Example 3. Ordinal regression for risk of newborn respiratory distress syndrome
A study by Kaplan, Chapman and coworkers (32) extending work by Bernstein and Rundell (33) looked at the relationship between gestational age and RDS of the newborn and used the ordinal regression model to predict expected outcomes (33). Table IX gives probabilities for the prediction of risk.
Table IX. Probabilities of RDS given by gestational age and S/A ratio.
Example 4. Prediction of myocardial infarction risk by EKG and troponin T at 0.1 ng/ml
Bernstein, Zarich and Qamar (34) carried out a study in which the physicians were blinded to the troponin T results. A randomized prospective study of over 800 patients followed (35-37). The chest pain characteristics, EKG findings and troponin T results were reviewed for consecutive patients entered into the study (34). EKG results were scaled as: negative, nonspecific, 0; ST depression or T wave inversion, 1, ST elevation or new Q-wave, 2. Troponin T was scaled as follows: 0-0.075 ng/ml, 0; 0.076-0.099, 1; > 0.1.The diagnoses were as follows: noncardiac, cardiac and nonischemic, 1; Unstable angina with MI ruled out, 2; non ST or ST elevation MI, 3. Table X is the table of odds ratios and probabilities.
Table X. Ordinal regression of EKG and troponin T on diagnoses
Ovarian Cancer Survival
Rosman and Schwartz have reported a relationship between CA125 post-chemotherapy of ovarian carcinomatosis and serum half-life of CA125. We examined a published data set provided by Dr. Martin Rosman. Data were analyzed from 55 women who were treated at YaleUniversity, had an evaluable CA_{125} half-life (t_{1/2}), and were followed for disease recurrence for at least 3 years. We modeled survival or remission for ovarian cancer using operative findings, stage, and CA_{125} halflife (46). Figure 9 is a plot of the CA125 elimination half-life vs the Kullback-Liebler distance using the data provided by Dr. Martin Rosman. The K-L distance is the difference between the total entropy of the data in which association is removed and the observed entropy for each value of CA125. The t_{1/2} is 10 days. What Rudolph and Bernstein (43) have referred to as effective information is KL distance. This was done to determine the value of CA125 that best predicts survival.
Figure 9 CA125 halflife
The next step was to carry out a Kaplan Meier survival plot with Cox regression on the data vs the time to death or remission. A survival of 30 months is considered a cure. A survival less is considered a remission. Some patients died only shortly into chemotherapy. The study result is shown in Figure 11.
Figure 10. Kaplan Meier plot
We also examined the associations between OPERATIVE FINDINGS and CA_{125} to REMISSION and NONREMISSION or RELAPSE using a universal regression model under bivariate normality with estimation of generalized odds-ratios developed by Jay Magidson (Statistical Innovations, Inc., Belmont, MA). It uses a parallel log-odds model based on adjacent odds to describe the data. The universal regression is carried out after scaling the continuous variables with intervals we determined as follows: halflife- 0-5, 6-10, 11-15, 16-20, >20. A crosstabulation is constructed using the scaled variables as treatment vs. the effect (full, short remission or none), to obtain the frequency tabulation of treatment level vs remission, relapse or nonremission.
Table XI is a cross-tabulation of the observed and expected outcome frequencies in remission (rem), short remission (short,< 30 months) and non-remission (none) versus the scaled half-lives. Relapse and failure to achieve remission were combined into one outcome class. The means and standard error of the means (SEM) of half-life versus remission or non-remission/relapse are effectively separated (F=7.42, p < 0.01) as follows: Remission, 7.9, 2.8, [19]; Relapse/Non-remission, 17.4, 2.05, [36].
Table XII. Observed and expected odds and odds-ratios of remission, relapse and no response by half-life
Perspective for the Future
Linear regression has been used extensively for methods comparison and for quality control, exclusively based on distributional assumptions and distance from the center of the population sample. This is essential to analytical chemistry principles, but it has reached a limit. The last 30 years has seen the development of very powerful regression tools that are not dependent on distributional assumptions and that move the method into classification and prediction. The development of the Akaike Information Criterion (38-40) brought together two major disciplines that had separate developments, information theory and statistics. The work by Bernstein et al. (41-42) in predicting myocardial infarction using bivariate density estimation, and with Kullback-Liebler Distance (43, 44), an extension of work by Rypka (45) is closely related. The use of tables and the scaling of data has been the dominant approach to statistics that uses ordinal and categorical data in outcomes research. This has become a powerful method used in studies of placebo and drug effects. The approach is readily amenable to studies of laboratory tests and outcomes. Outcomes studies will be designed and carried out for laboratory tests that will ask questions appropriate for the clinical laboratory sciences, and that will not be subordinated to pharmaceutical evaluations, which currently have exclusion criteria that are inappropriate for laboratory investigations.
Summary
Regression has a long history in the development of modern science since the 18^{th} century. Regression has had a role in the emergence of physics, anthropology, psychology, and chemistry. But its development was initially tied to linear association and assumption of normal distribution. There are many associations that are tied to frequency of discrete events. The use of chi-square as a measure of goodness of fit has such a tie to genetic analysis and to classification tables. The importance of outcomes management and the recognition of a multivariable data structure that needs to be explored leads us to a new domain of regression models and includes an assumption that the dependent variable may not be know with certainty. This is the case with the emerging models known as mixture models, structural equation models and latent class models. This type of model is not traditionally a regression model and looks at defined variables and also unmeasured, hidden or latent variables (factors) in the model. However, there are factor analysis and regression forms of the LCM that are included in the LCM software releases of Statistical Innovations, Inc. (Latent Gold). This important subject is beyond the scope of this review, but Demidenko (47) has written an excellent text on the subject.
References:
19. Hoel PG. Elementary Statistics, Testing Hypotheses: The difference between two means. Chapter 3.3. pp133-117. 1960. Wiley, New York.
20. Hoel PG. Ibid. Regression. Chapter 9. pp141-153.
21. Norman GR, Streiner DL. Biostatistics: The Bare Essentials. Two repeated observations: The paired t-test and alternatives. Chapter 10. pp89-93. 2000, BC Deckker, Hamilton, Ont., Canada.
22. Norman GR, Streiner DL. Ibid. Simple regression and correlation. Chapter 13. pp118-126.
23. Norman GR, Streiner DL. Ibid. Multiple regression. Chapter 14. pp127-137.
24. Norman GR, Streiner DL.Ibid. Logistic regression. Chapter 15. pp139-144.
25. Magidson J. “Multivariate Statistical Models for Categorical Data,” Chapters 3 & 4 in Bagozzi R, Advanced Methods of Marketing Research, Blackwell, 1994.
26. Magidson J. Introducing a new graphical method for the analysis of an ordered categorical response – Part I. Journal of Targeting, Measurement and Analysis for Marketing (UK). 1995; IV(2):133-148.
27. Magidson J. Introducing a new graphical model for the analysis of an ordered categorical response – Part II. Ibid. 1996;IV(3):214-227.
28. Magidson J. Maximum likelihood assessment of clinical trials based on an ordered categorical response. Drug information Journal. 1996;30:143-170.
29. Goodman LA. Simple models for the analysis of associations in cross- classifications having ordered categories. Journal of the American Statistical Association. 1979;74: 537-552. Reprinted in The Analysis of Cross-Classified Data Having Ordered Categories. 1984, HarvardUniversity Press.
30. Goodman LA. Association models and the bivariate normal for contingency tables with ordered categories. Biometrika 1981;68:347-355.
31.Brugler L, Stankovic AK, Schlefer M, Bernstein L. A simplified nutrition screen for hospitalized patients using readily available laboratory and patient information. Nutrition 2005;21:650-658.
32. Kaplan LA, Chapman JF, Bock JL, Santa Maria E, Clejan S, et al. Prediction of respiratory distress syndrome using the Abbott FLM-II amniotic fluid assay. Clin Chim Acta 2002;326[1-2]:61-68.
33. Bernstein LH, Stiller R, Menzies C, McKenzie M, Rundell C. Amniotic fluid polarization of fluorescence and lecithin/sphingomyelin ratio decision criteria assessed. Yale J Biol Med 1995; 68(2):101-117.
34. Bernstein LH, Qamar A, McPherson C, Zarich S. Evaluating a new graphical ordinal logit method (GOLDminer) in the diagnosis of myocardial infarction utilizing clinical features and laboratory data. Yale J Biol Med 1999; 72:259-268.
35. Bernstein L, Bradley K, Zarich S. GOLDmineR: Improving Models for Classifying Patients with Chest Pain. Yale J Biol Med 2002;75: 183-198.
36. Zarich S, Bradley K, Seymour J, Ghali W, Traboulsi A, et al. Impact of troponin T determinations on hospital resources and costs in the evaluation of patients with suspected myocardial ischemia. Amer J Cardiol 2001;88:732-6.
37. Zarich SW, Qamar AU, Werdmann MJ, Lizak LS, McPhersonCA, Bernstein LH. Value of a single troponin T at the time of presentation as compared to serial CK-MB determinations in patients with suspected myocardial ischemia. Clin Chim Acta 2002;326:185-192.
38. Akaike H. Information theory and an extension of maximum likelihood principle. In B.N. Petrov and F. Csake (eds.), Second International Symposium on Information Theory. 1973, Akademiai Kiado, pp 267-281, Budapest.
39. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automation Control, AC-19, 1974; 716-723.
40. Dayton CM. Information Criteria for the Paired-Comparisons Problem. American Statistician. 1998;52: 144-151.
41. Bernstein LH, Good IJ, Holtzman GI, Deaton ML, Babb J: Diagnosis of myocardial infarction from two enzyme measurements of creatine kinase isoenzyme MB with use of nonparametric probability estimation. Clin Chem 1989;35:444-7.
42. Bernstein LH, Good IJ, Holtzman GI, Deaton ML, and Babb J. Diagnosis of heart attack from two enzyme measurements by means of bivariate probability density estimation: statistical details. J Statistical Computation and Simulation. 1989.
43. Rudolph RA, Bernstein LH, Babb J. Information-induction for the diagnosis of myocardial infarction. Clin Chem 1988;34:2031-8.
44. Kullback S, Liebler RA. On information and sufficiency. Ann Mathematical Statistics 1951;22:79-86.
45. Rypka EW. Methods to evaluate and develop the decision process in the selection of tests. Clinics in Laboratory Med 1992;12[2]: 351-385.
46. Bernstein LH. Outcomes-based Decision Support: How to Link Laboratory Utilization to Clinical Endpoints. Chapter 8. Pp91-128. In Bissell MG, ed. Laboratory-Related Measures of Patient Outcomes: An Introduction. 2000. AACC Press. Washington, DC.
47. Demidenko E. Mixture models: Theory and applications. 2004. Wiley-Interscience. Hoboken, NJ.
48. Martin RF. General Deming regression for estimating systematic bias and confidence interval in method-comparison studies. Clin Chem 2000;46:100-104.
49. Magidson J. Opportunities grow on trees. A general alternative to linear regression. Monotonic regression of dichotomous, ordinal and grouped continuous dependent variables. 1998. Statistical Innovations, Inc. Belmont, MA.
Figures and Tables Version 8 Multivariable
Table I. Regression of eGFR and hemoglobin to predict Nt-proBNP
Step number | : | 0 |
R | : | 0.376 |
R-square | : | 0.141 |
In | Effect | Coefficient | Standard Error | Std. Coefficient |
Tolerance | df | F-ratio | p-value |
1 | Constant | |||||||
2 | eGFR | -83.499 | 14.063 | -0.297 | 0.951 | 1 | 35.256 | 0.000 |
3 | Hgb | -910.224 | 260.436 | -0.175 | 0.951 | 1 | 12.215 | 0.001 |
Information Criteria
AIC | 7785.028 |
AIC (Corrected) | 7785.139 |
Schwarz’s BIC | 7800.628 |
Dependent Variable | NTproBNP (pg/ml) |
N | 365 |
Multiple R | 0.376 |
Squared Multiple R | 0.141 |
Adjusted Squared Multiple R | 0.137 |
Standard Error of Estimate | 10287.156 |
Analysis of Variance
Source | SS | df | Mean Squares | F-ratio | p-value |
Regression | 6.309E+009 | 2 | 3.155E+009 | 29.809 | 0.000 |
Residual | 3.831E+010 | 362 | 1.058E+008 |
Table II. Linear regression of NKLog(Nt-proBNP0/eGFR by eGFR and hemoglobin
Log transform flattens the high Nt-proBNP scale and eGFR and age are normalized
R |
: |
0.597 |
R-square |
: |
0.357 |
In | Effect | Coefficient | Standard Error | Std. Coefficient |
Tolerance | df | F-ratio | p-value |
1 | Constant | |||||||
2 | eGFR | -1.873 | 0.144 | -0.573 | 0.933 | 1 | 170.011 | 0.000 |
3 | Hgb | -4.259 | 2.436 | -0.077 | 0.933 | 1 | 3.056 | 0.081 |
Information Criteria
AIC | 4299.786 |
AIC (Corrected) | 4299.899 |
Schwarz’s BIC | 4315.331 |
Dependent Variable | NKLogNTGFR |
N | 360 |
Multiple R | 0.597 |
Squared Multiple R | 0.357 |
Adjusted Squared Multiple R | 0.353 |
Standard Error of Estimate | 94.260 |
Regression Coefficients B = (X’X)-1X’Y
Effect | Coefficient | Standard Error | Std. Coefficient |
Tolerance | t | p-value |
CONSTANT | 256.151 | 27.745 | 0.000 | . | 9.232 | 0.000 |
MDRD_GFR | -1.873 | 0.144 | -0.573 | 0.933 | -13.039 | 0.000 |
Hgb | -4.259 | 2.436 | -0.077 | 0.933 | -1.748 | 0.081 |
Table III. One-way ANOVA of D-dimer for positive and negative scans
Dependent Variable | D_DIMER |
N | 817 |
Analysis of Variance
Source | Type III SS | df | Mean Squares | F-ratio | p-value |
VENDUP | 43456570.851 | 1 | 43456570.851 | 68.278 | 0.000 |
Error | 5.187E+008 | 815 | 636461.763 |
Table 4. Discriminant function for CHF, renal insufficiency and anemia by age, NT-proBNP, creatinine and hemoglobin
Group Frequencies | ||
0 | 1 | 2 |
135 | 335 | 235 |
Group Means | |||
0 | 1 | 2 | |
NTproBNP (pg/ml) | 1516.369 | 5964.054 | 12902.662 |
Creatinine | 0.716 | 1.654 | 2.103 |
Hgb | 11.972 | 11.533 | 11.305 |
Age | 60.570 | 71.373 | 74.966 |
Between Groups F-matrix df : 4 699 |
|||
0 | 1 | 2 | |
0 | 0.000 | ||
1 | 23.445 | 0.000 | |
2 | 45.108 | 11.788 | 0.000 |
Wilks’s Lambda
Lambda |
: |
0.778 |
df |
: |
(4,2,702) |
Approx. F-ratio |
: |
23.337 |
df |
: |
(8,1398) |
p-value |
: |
0.000 |
Classification Functions | |||
0 | 1 | 2 | |
CONSTANT | -32.018 | -35.196 | -37.394 |
Variable |
F-to-remove | Tolerance | |
5 | NTproBNP (pg/ml) |
13.489 | 0.801 |
6 | Creatinine | 21.368 | 0.799 |
7 | Hgb | 0.190 | 0.928 |
3 | Age | 38.632 | 0.948 |
Test Statistic | |||||
Statistic | Value | Approx. F-ratio |
df |
p-value | |
Wilks’s Lambda | 0.778 | 23.337 | 8 | 1398 | 0.000 |
Pillai’s Trace | 0.226 | 22.295 | 8 | 1400 | 0.000 |
Lawley-Hotelling Trace | 0.279 | 24.382 | 8 | 1396 | 0.000 |
Table V. The DFA calculations for Figure 9.
Group Frequencies | |||
0 | 1 | 2 | |
221.000 | 631.000 | 571.000 |
Means | |||
NKLgNTproGFRe | 15.589 | 55.971 | 81.159 |
MDRD | 123.130 | 61.940 | 48.748 |
Group 0 Discriminant Function Coefficients | |||
NormKLgNTproGFR- e |
MDRDest | Constant | |
NKLgNTproGFRe | -0.015 | ||
MDRD | -0.001 | 0.000 | |
Constant | 0.588 | 0.052 | -15.590 |
Group 1 Discriminant Function Coefficients | |||
NormKLgNTproGFR- e |
MDRDest | Constant | |
NKLgNTproGFRe | 0.000 | ||
MDRD | 0.000 | -0.001 | |
Constant | 0.024 | 0.089 | -12.106 |
Group 2 Discriminant Function Coefficients | |||
NormKLgNTproGFR- e |
MDRDest | Constant | |
NKLgNTproGFRe | 0.000 | ||
MDRD | 0.000 | -0.001 | |
Constant | 0.015 | 0.147 | -13.077 |
Between Groups F-matrix df : 2 1419 |
|||
0 | 1 | 2 | |
0 | 0.000 | ||
1 | 236.650 | 0.000 | |
2 | 335.228 | 21.342 | 0.000 |
Wilks’s Lambda for the Hypothesis
Lambda |
: |
0.671 |
df |
: |
(2,2,1420) |
Approx. F-ratio |
: |
156.542 |
df |
: |
(4,2838) |
p-value |
: |
0.000 |
Classification Matrix (Cases in row categories classified into columns) | ||||
0 | 1 | 2 | %correct | |
0 | 206 | 15 | 0 | 93 |
1 | 237 | 363 | 31 | 58 |
2 | 69 | 459 | 43 | 8 |
Total | 512 | 837 | 74 | 43 |
Jackknifed Classification Matrix | ||||
0 | 1 | 2 | %correct | |
0 | 205 | 16 | 0 | 93 |
1 | 237 | 363 | 31 | 58 |
2 | 69 | 462 | 40 | 7 |
Total | 511 | 841 | 71 | 43 |
Test Statistic | |||||
Statistic | Value | Approx. F-ratio |
df |
p-value | |
Wilks’s Lambda | 0.671 | 156.542 | 4 | 2838 | 0.000 |
Pillai’s Trace | 0.330 | 140.347 | 4 | 2840 | 0.000 |
Lawley-Hotelling Trace | 0.488 | 173.026 | 4 | 2836 | 0.000 |
Canonical Discriminant Functions | ||
1 | 2 | |
Constant | -1.912 | -1.075 |
NKLgNTproGFRe | 0.001 | 0.009 |
MDRD | 0.028 | 0.008 |
Canonical Discriminant Functions : Standardized by Within Variances | ||
1 | 2 | |
NKLgNTproGFRe | 0.085 | 1.061 |
MDRD | 1.026 | 0.284 |
Canonical Scores of Group Means | ||
1 | 2 | |
0 | 1.576 | 0.034 |
1 | -0.122 | -0.069 |
2 | -0.476 | 0.063 |
Table VI Ordinal regression model for combined 3 predictors of malnutrition risk.
Predictor L^{2} p exp(beta)
Poor oral intake 60.29 8.2e^{-15} 5.3
Malnutrition related condition 46.29 1.0e^{-11} 3.06
Albumin 152.01 6.3e^{-35} 3.16
Table VII. Expected Odds Ratios – Diagnosis Thalassemia
Odds-Ratios
Me,M,A_{2}(e) Thalassemia
1,1,1 9713
1,1,0 1696
1,0,1 263
0,1,1 212
1,0,0 46
0,1,0 37
0,0,1 6
0,0,0 1
Table VIII. Probabilities of RDS given by gestational age and S/A ratio.
Dependent variable: Respiratory outcome (Resp_Sca)
Predictors: Surfactant to albumin (S/A) Ratio_45: 0, > 45; 1, 21-44; 2, < 21;
Gestational age at delivery: 0, > 36; 1, 34-36; 2, < 34.
S/A Ratio_45 p = 8.7*10^{-22}
Gestational Age at Delivery Scaled p = 4.2*10^{-9}
Combined variables: ChiSq = 130.14, p = 5.1*10^{-28}, R^{2} = 0.433, phi = 0.8231, exp(beta) = 2.16 (S/A), 1.88 (GA)
Definition (S/A, GA) | Exp. Probabilities | Exp. Odds-Ratios |
0-20, < 34 | 0.84 | 4427 |
0-20, 34-36 | 0.64 | 668 |
21-44, < 34 | 0.57 | 441 |
0-20, > 36 | 0.31 | 101 |
21-44, 34-36 | 0.25 | 67 |
> 45, < 34 | 0.19 | 44 |
21-44, > 36 | 0.06 | 10 |
> 45, 34-36 | 0.04 | 7 |
> 45, > 36 | 0.01 | 1 |
Table IX. Ordinal regression of EKG and troponin T on diagnoses
Association Summary L² df p-value R² phi
Explained by Model 206.52 2 1.4e-45 0.686 1.3856
Residual 48.64 14 1.0e-5
Total 255.16 16 4.5e-45
Odds Ratios and probabilities for diagnoses
average 1 2 0 1 2
score 0.00 0.00
2,3 2.87 466.82 10086.03 0.01 0.11 0.88
2,2 2.67 105.78 1087.95 0.04 0.20 0.75
1,3 2.64 95.35 931.05 0.05 0.21 0.74
2,1 1.95 23.97 117.35 0.26 0.27 0.47
1,2 1.87 21.61 100.43 0.29 0.26 0.45
0,3 1.79 19.48 85.95 0.32 0.26 0.42
1,1 0.67 4.90 10.83 0.73 0.15 0.12
0,2 0.61 4.41 9.27 0.75 0.14 0.11
0,1 0.12 1.00 1.00 0.95 0.04 0.01
Table X. Observed and expected odds and odds-ratios of remission, relapse and no response by half-life
Half-life exp. odds exp. odds-ratios
(range, days) Rem short none Rem short none
> 20 1 4.16 17.11 1 12.49 56.07
16-20 1 2.21 4.84 1 6.64 44.16
11-15 1 1.18 1.37 1 3.53 12.49
6-10 1 0.63 0.39 1 1.88 3.53
< 6 1 0.33 0.11 1 1 1
HL-ref 1 0.33 0.11 1 1 1
Figure 1. log_NT-proBNP vs eGRF
Figure 2. Boxplots of NT-proBNP and WHO criteria
Figure 3. NT-proBNP vs Hb
Figure 4. 3D plot of NT-proBNP, MDRD eGFR, Hb
Figure 5. 3D plot of Normalized K*Log_NTproBNP/eGFR, eGFR, Hb
Figure 6. D-dimer Confidence Intervals vs Imaging
Figures 7 & 8. Canonical Scores Plots
Figure 9. Entropy Plot of CA125 halflife (x) vs Effective Information
(Kullback Entropy) showing sharp drop in Entropy at 10 days (equivalent to information added to resolve uncertainty). AS developed by Rosser R Rudolph
Figure 10. Kaplan Meier Plot of CA125 half-life vs Survival in Ovarian Cancer