# Multiple (General) Linear Regression

Menu location: **Analysis_Regression and Correlation_Multiple Linear**.

This is a generalised regression function that fits a linear model of an outcome to one or more predictor variables.

The term multiple regression applies to linear prediction of one outcome from several predictors. The general form of a linear regression is:

Y' = b_{0} + b_{1}x_{1} + b_{2}x_{2} + ... + b_{k}x_{k}

- where Y' is the predicted outcome value for the linear model with regression coefficients b_{1 to k} and Y intercept b_{0} when the values for the predictor variables are x_{1 to k}. The regression coefficients are analogous to the slope of a simple linear regression.

Regression assumptions:

- Y is linearly related to the combination of x or a transformations of x
- deviations from the regression line (residuals) follow a normal distribution
- deviations from the regression line (residuals) have uniform variance

__Classifier predictors__

If one of the predictors in a regression model classifies observations into more than two classes (e.g. blood group) then you should consider splitting it into separate dichotomous variables as described under dummy variables.

__Influential data and residuals__

A residual for a Y point is the difference between the observed and fitted value for that point, i.e. it is the distance of the point from the fitted regression line. If the pattern of residuals changes along the regression line then consider using rank methods or linear regression after an appropriate transformation of your data.

The influential data option in StatsDirect gives an analysis of residuals and allows you to save the residuals and their associated statistics to a workbook. It is good practice to examine a scatter plot of the residuals against fitted Y values. You might also wish to inspect a normal plot of the residuals and perform a Shapiro-Wilk test to look for evidence of non-normality.

Standard error for the predicted Y, leverage hi (the ith diagonal element of the hat (XXi) matrix), Studentized residuals, jackknife residuals, Cook's distance and DFIT are also given with the residuals. For further information on analysis of residuals please see Belsley et al. (1980), Kleinbaum et al. (1998) or Draper and Smith (1998).

- where p is the number of parameters in the model, n is the number of observations, e_{i} is a residual, r_{i} is a Studentized residual, r-i is a jackknife residual, s² is the residual mean square, s²-i is an estimate of s² after deletion of the ith residual, h_{i} is the leverage (ith diagonal element of the hat or XXi matrix), d_{i} is Cook's distance and DFIT_{i}is DFFITS.

- Studentized residuals have a t distribution with n-p degrees of freedom.
- Jackknife residuals have a t distribution with n-p-1 degrees of freedom. Note that SAS refers to these as "rstudent".
- If leverage (h
_{i}) is larger than the minimum of 3p/n and 0.99 then the ith observation has unusual predictor values. - Unusual predicted as opposed to predictor values are indicated by large residuals.
- Cook's distance and DFIT each combine predicted and predictor factors to measure overall influence.
- Cook's distance is unusually large if it exceeds F (a, p, n-p) from the F distribution.
- DFIT is unusually large if it exceeds 2*sqr(p/n).

__Collinearity__

The degree to which the x variables are correlated, and thus predict one another, is collinearity. If collinearity is so high that some of the x variables almost totally predict other x variables then this is known as multicollinearity. In such cases, the analysis of variance for the overall model may show a highly significantly good fit, when paradoxically; the tests for individual predictors are non-significant.

Multicollinearity causes problems in using regression models to draw conclusions about the relationships between predictors and outcome. An individual predictor's P value may test non-significant even though it is important. Confidence intervals for regression coefficients in a multicollinear model may be so high that tiny changes in individual observations have a large effect on the coefficients, sometimes reversing their signs.

StatsDirect gives variance inflation factor (and the reciprocal, tolerance) as a measure of collinearity.

VIF_{i} is the variance inflation factor for the ith predictor. R_{i}² is the multiple correlation coefficient when the ith predictor is taken as the outcome predicted by the remaining x variables.

If you detect multicollinearity you should aim to clarify the cause and remove it. For example, an unnecessary collinear predictor might be dropped from your model, or predictors might meaningfully be combined, e.g. combining height and weight as body mass index. Increasing your sample size reduces the impact of multicollinearity. More complex statistical techniques exist for dealing with multicollinearity, but these should be used under the expert guidance of a Statistician.

Chatterjee et al (2000) suggest that multicollinearity is present if the mean VIF is considerably larger than 1, and the largest VIF is greater than 10 (others choose larger values, StatsDirect uses 20 as the threshold for marking the result with a red asterisk).

__Prediction and adjusted means__

The prediction option allows you to calculate values of the outcome (Y) using your fitted linear model coefficients with a specified set of values for the predictors (X1…p). A confidence interval and a prediction interval (Altman, 1991) are given for each prediction.

The default X values shown are those required to calculate the least squares mean for the model, which is the mean of Y adjusted for all X. For continuous predictors the mean of X is used. For categorical predictors you should use X as 1/k, where k is the number of categories. StatsDirect attempts to identify categorical variables but you should check the values against these rules if you are using categorical predictors in this way.

For example, if a model of Y = systolic blood pressure, X1 = sex, X2 = age was fitted, and you wanted to know the age and sex adjusted mean systolic blood pressure for the population that you sampled, you could use the prediction function to give the least squares mean as the answer, i.e. with X1 as 0.5 and X2 as mean age. If you wanted to know the mean systolic blood pressure for males, adjusted for age then you would set X1 to 1 (if male sex is coded as 1 in your data).

__Partial correlation__

The partial correlation coefficient for a predictor X_{k} describes the relationship of Y and X_{k} when all other X are held fixed, i.e. the correlation of Y and X_{k} after taking away the effects of all other predictors in the model. The r statistic displayed with the main regression results is the partial correlation. It is calculated from the t statistic for the predictor as:

__Multiple correlation__

The multiple correlation coefficient (R) is Pearson's product moment correlation between the predicted values and the observed values (Y' and Y). Just as r² is the proportion of the total variance (s²) of Y that can be explained by the linear regression of Y on x, R² is the proportion of the variance explained by the multiple regression. The significance of R is tested by the F statistic of the analysis of variance for the regression.

An adjusted value of R² is given as Ra²:

The adjustment allows comparison of Ra² between different regression models by compensating for the fact that R² is bound to increase with the number of predictors in the model.

The Durbin Watson test statistic can be used to test for certain types of serial correlation (autocorrelation). For example, if a critical instrument gradually drifted off scale during collection of the outcome variable then there would be correlations due to the drift; in time ordered data these may be detected by autocorrelation tests such as Durbin Watson. See Draper and Smith (1998) for more information, including critical values of the test statistic.

__Automatic selection of predictors__

There are a number of methods for selecting a subset of predictors that produce the "best" regression. Many statisticians discourage general use of these methods because they can detract from the real-world importance of predictors in a model. Examples of predictor selection methods are step-up selection, step-down selection, stepwise regression and best subset selection. The fact that there is no predominant method indicates that none of them are broadly satisfactory, a good discussion is given by Draper and Smith (1998). StatsDirect provides best subset selection by examination of all possible regressions. You have the option of either minimum Mallow's Cp or maximum overall F as base statistic for subset selection. You may also force the inclusion of variables in this selection procedure if you consider their exclusion to be illogical in "real world" terms. Subset selection is best performed under expert statistical guidance.

__Weights for outcome observations__

StatsDirect can perform a general linear regression for which some outcome (Y) observations are given more weight in the model than others. An example of weights is 1/variance for each Y_{i} where Y_{i} is a mean of multiple observations. This sort of analysis makes strong assumptions and is thus best carried out only under expert statistical guidance. An unweighted analysis is performed if you press cancel when asked for a weight variable.

__Technical Validation__

StatsDirect uses QR decomposition by Givens rotations to solve the linear equations to a high level of accuracy (Gentleman, 1974; Golub and Van Loan, 1983). Predictors that are highly correlated with other predictors are dropped from the model (you are warned of this in the results). If the QR method fails (rare) then StatsDirect will solve the system by singular value decomposition (Chan, 1982).

__Example__

From Armitage and Berry (1994, p. 316).

Test workbook (Regression worksheet: X1, X2, YY).

The following data are from a trial of a hypotensive drug used to lower blood pressure during surgery. The outcome/dependent variable (YY) is minutes taken to recover an acceptable (100mmHg) systolic blood pressure and the two predictor or explanatory variables are, log dose of drug (X1) and mean systolic blood pressure during the induced hypotensive episode (X2).

YY | X1 | X2 |

7 | 2.26 | 66 |

10 | 1.81 | 52 |

18 | 1.78 | 72 |

4 | 1.54 | 67 |

10 | 2.06 | 69 |

13 | 1.74 | 71 |

21 | 2.56 | 88 |

12 | 2.29 | 68 |

9 | 1.80 | 59 |

65 | 2.32 | 73 |

20 | 2.04 | 68 |

31 | 1.88 | 58 |

23 | 1.18 | 61 |

22 | 2.08 | 68 |

13 | 1.70 | 69 |

9 | 1.74 | 55 |

50 | 1.90 | 67 |

12 | 1.79 | 67 |

11 | 2.11 | 68 |

8 | 1.72 | 59 |

26 | 1.74 | 68 |

16 | 1.60 | 63 |

23 | 2.15 | 65 |

7 | 2.26 | 72 |

11 | 1.65 | 58 |

8 | 1.63 | 69 |

14 | 2.40 | 70 |

39 | 2.70 | 73 |

28 | 1.90 | 56 |

12 | 2.78 | 83 |

60 | 2.27 | 67 |

10 | 1.74 | 84 |

60 | 2.62 | 68 |

22 | 1.80 | 64 |

21 | 1.81 | 60 |

14 | 1.58 | 62 |

4 | 2.41 | 76 |

27 | 1.65 | 60 |

26 | 2.24 | 60 |

28 | 1.70 | 59 |

15 | 2.45 | 84 |

8 | 1.72 | 66 |

46 | 2.37 | 68 |

24 | 2.23 | 65 |

12 | 1.92 | 69 |

25 | 1.99 | 72 |

45 | 1.99 | 63 |

72 | 2.35 | 56 |

25 | 1.80 | 70 |

28 | 2.36 | 69 |

10 | 1.59 | 60 |

25 | 2.10 | 51 |

44 | 1.80 | 61 |

To analyse these data in StatsDirect you must first enter them into three columns in the workbook appropriately labelled. Alternatively, open the test workbook using the file open function of the file menu. Then select Multiple Linear Regression from the Regression and Correlation section of the analysis menu. When you are prompted for regression options, tick the "calculate intercept" box (it is unusual to have reason not to calculate an intercept) and leave the "use weights" box unticked (regression with unweighted responses). Select the column "YY" when prompted for outcome and "X1" and "X2" when prompted for predictor data.

For this example:

__Multiple linear regression__

Intercept | b0 = 23.010668 | t = 1.258453 | P = 0.2141 | |

X1 | b1 = 23.638558 | r = 0.438695 | t = 3.45194 | P = 0.0011 |

X2 |
b2 = -0.714675 | r = -0.317915 | t = -2.371006 |
P = 0.0216 |

yy = 23.010668 +23.638558 x1 -0.714675 x2

__Analysis of variance from regression__

Source of variation | Sum Squares | DF | Mean Square |

Regression | 2783.220444 | 2 | 1391.610222 |

Residual | 11007.949367 | 50 | 220.158987 |

Total (corrected) | 13791.169811 | 52 |

Root MSE = 14.837755

F = 6.320933 P = .0036

Multiple correlation coefficient | (R) = 0.449235 |

R² = 20.181177% | |

Ra² = 16.988424% |

Durbin-Watson test statistic = 1.888528

The variance ratio, F, for the overall regression is highly significant thus we have very little reason to doubt that either X1 or X2 is, or both are, associated with YY. The r square value shows that only 20% of the variance of YY is accounted for by the regression, therefore the predictive value of this model is low. The partial correlation coefficients are shown to be significant but the intercept is not.

__Technical validation results__

The American National Institute of Standards and Technology provide Statistical Reference Datasets for testing statistical software (McCullough and Wilson, 1999; http://www.itl.nist.gov/div898/strd/). The results below for the Longley data set (Longley, 1967) are given to 12 decimal places:

__Multiple linear regression__

Intercept | b0 = -3482258.63459587 | t = -3.910802918155 | P = .0036 | |

x1 | b1 = 15.061872271426 | r = 0.059022267544 | t = 0.177376028231 | P = .8631 |

x2 | b2 = -0.035819179293 | r = -0.335803857852 | t = -1.069516317221 | P = .3127 |

x3 | b3 = -2.020229803817 | r = -0.809509044959 | t = -4.136427355941 | P = .0025 |

x4 | b4 = -1.033226867174 | r = -0.849083964187 | t = -4.821985310446 | P = .0009 |

x5 | b5 = -0.051104105654 | r = -0.075137380464 | t = -0.226051144664 | P = .8262 |

x6 | b6 = 1829.15146461358 | r = 0.801139716237 | t = 4.01588981271 | P = .003 |

y = -3482258.63459587 +15.061872271426 x1 -0.035819179293 x2 -2.020229803817 x3 -1.033226867174 x4 -0.051104105654 x5 +1829.15146461358 x6