Linearity with Replicates of the Outcome (Y)

Menu location: Analysis_Regression and Correlation_Grouped Linear_Linearity.

This function gives a test for linearity in a simple linear regression model when the response/outcome variable (Y) has been measured repeatedly.

The standard analysis of variance for simple (one predictor) linear regression tests for the possibility that the observed data fit to a straight line but it does not test whether or not a straight line model is appropriate in the first place. This function can be used to test that assumption of linearity.

For studies that employ linear regression, it is worth collecting repeat Y observations. This enables you to run a test of linearity and thus justify or refute the use of linear regression in subsequent analysis (Armitage and Berry, 1994). Replicate Y observations should be entered in separate workbook columns (variables), one column for each observation in the predictor (x) variable. The number of Y replicate variables which you are prompted to select is the number of rows in the x variable.

Please note that the repeats of Y observations should be arranged in a worksheet so that each Y column contains repeats of a single Y observation that matches an x observation (i.e. row 4 of x should relate to column 4 of Y repeats). If your data are arranged so that each repeat of Y is in a separate column then use the Rotate Data Block function of the Data menu on your Y data before selecting them for this function.

Assumptions:

Y replicates are a random sample from a normal distribution
deviations from the regression line (residuals) follow a normal distribution
deviations from the regression line (residuals) have uniform variance

Generalisations of this method to models with more than one predictor are available (Draper and Smith, 1998). Generalised replicate analysis is best done as a part of exploratory modelling by a Statistician.

Technical Validation

Linearity is tested by analysis of variance for the linear regression of k outcome observations for each level of the predictor variable (Armitage, 1994):

- where SS_regression is the sum of squares due to the regression of Y on x, SS_repeats is the part of the usual residual sum of squares that is due to variation within repeats of the outcome observations, SS_total is the total sum of squares and the remainder represents the sum of squares due to deviation of the means of repeated outcome observations from the regression. Y is the outcome variable, x is the predictor variable, N is the total number of Y observations and n_j is the number of Y repeats for the jth x observation.

Example

From Armitage and Berry (1994, p. 288).

Test workbook (Regression worksheet: Log Dose_Std, BD 1_Std, BD 2_Std, BD 3_Std).

A preparation of Vitamin D is tested for its effect on bones by feeding it to rats that have an induced lack of mineral in their bones. X-ray methods are used to test the re-mineralisation of bones in response to the Vitamin D.

Log dose of Vit D
0.544	0.845	1.146
Bone density score
0	1.5	2
0	2.5	2.5
1	5	5
2.75	6	4
2.75	4.25	5
1.75	2.75	4
2.75	1.5	2.5
2.25	3	3.5
2.25		3
2.5		2
		3
		4
		4

To analyse these data in StatsDirect you must first enter them into four columns in the workbook and label them appropriately. The first column is just three rows long and contains the three log doses of Vitamin D above (logarithms are taken because, from previous research, the relationship between bone re-mineralisation and Vitamin D is known to be log-linear). The next three columns represent the repeated measures of bone density for each of the three levels of log dose of Vitamin D which are represented by the rows of the first column. Alternatively, open the test workbook using the file open function of the file menu. Then select linearity from the groups section of the regression and correlation section of the analysis menu. Select the column marked "Log Dose_Std" when you are prompted for the x variable, this contains the three log dose levels. Then select, in one action, the three Y columns "BD 1_Std", "BD 2_Std" and "BD 3_Std" which correspond to each row (level) of the x variable i.e. 0.544 --> 0.845 --> 1.146.

For this example:

Linearity with replicates of Y

Source of variation	SSq	DF	MSq	VR
Due to regression	14.088629	1	14.088629	9.450512	P = .0047
Deviation of x means	2.903415	1	2.903415	1.947581	P = .1738
Within x residual	41.741827	28	1.49078
Total	58.733871	30

Regression slope is significant

Assumption of linearity supported

Thus the regression itself (meaning the slope) was statistically highly significant. If the deviations from x means had been significant then we should have rejected our assumption of linearity, as it stands they were not.

P values

non-linear models