Poisson regression

 

Menu location: Analysis_Regression & Correlation_Poisson

 

This function fits a Poisson regression model for multivariate analysis of numbers of uncommon events in cohort studies.

 

The multiplicative Poisson regression model is fitted as a log-linear regression (i.e. a log link and a Poisson error distribution), with an offset equal to the natural logarithm of person-time if person-time is specified (McCullagh and Nelder, 1989; Frome, 1983; Agresti, 2002). With the multiplicative Poisson model, the exponents of coefficients are equal to the incidence rate ratio (relative risk). These baseline relative risks give values relative to named covariates for the whole population. You can define relative risks for a sub-population by multiplying that sub-population's baseline relative risk with the relative risks due to other covariate groupings, for example the relative risk of dying from lung cancer if you are a smoker who has lived in a high radon area. StatsDirect offers sub-population relative risks for dichotomous covariates.

 

The outcome/response variable is assumed to come from a Poisson distribution. Note that a Poisson distribution is the distribution of the number of events in a fixed time interval, provided that the events occur at random, independently in time and at a constant rate. Poisson distributions are used for modelling events per unit space as well as time, for example number of particles per square centimetre.

 

Poisson regression can also be used for log-linear modelling of contingency table data, and for multinomial modelling. For contingency table counts you would create r + c indicator/dummy variables as the covariates, representing the r rows and c columns of the contingency table:

 

r1c1

r1c2

r1c3

r2c1

r2c2

r2c3

r3c1

r3c2

r3c3

 

Response

x_r1

x_r2

x_r3

x_c1

x_c2

x_c3

r1c1

1

0

0

1

0

0

r1c2

1

0

0

0

1

0

r1c3

1

0

0

0

0

1

r2c1

0

1

0

1

0

0

r2c2

0

1

0

0

1

0

r2c3

0

1

0

0

0

1

r3c1

0

0

1

1

0

0

r3c2

0

0

1

0

1

0

r3c3

0

0

1

0

0

1

 

Adequacy of the model

In order to assess the adequacy of the Poisson regression model you should first look at the basic descriptive statistics for the event count data. If the count mean and variance are very different (equivalent in a Poisson distribution) then the model is likely to be over-dispersed.

 

The model analysis option gives a scale parameter (sp) as a measure of over-dispersion; this is equal to the Pearson chi-square statistic divided by the number of observations minus the number of parameters (covariates and intercept). The variances of the coefficients can be adjusted by multiplying by sp. The goodness of fit test statistics and residuals can be adjusted by dividing by sp. Using a quasi-likelihood approach sp could be integrated with the regression, but this would assume a known fixed value for sp, which is seldom the case. A better approach to over-dispersed Poisson models is to use a parametric alternative model, the negative binomial.

 

The deviance (likelihood ratio) test statistic, G², is the most useful summary of the adequacy of the fitted model. It represents the change in deviance between the fitted model and the model with a constant term and no covariates; therefore G² is not calculated if no constant is specified. If this test is significant then the covariates contribute significantly to the model.

 

The deviance goodness of fit test reflects the fit of the data to a Poisson distribution in the regression. If this test is significant then a red asterisk is shown by the P value, and you should consider other covariates and/or other error distributions such as negative binomial.

 

StatsDirect does not exclude/drop covariates from its Poisson regression if they are highly correlated with one another. Models that are not of full (rank = number of parameters) rank are fully estimated in most circumstances, but you should usually consider combining or excluding variables, or possibly excluding the constant term. You should seek expert statistical if you find yourself in this situation.

 

Technical validation

The deviance function is:

image\STAT0262_wmf.gif

- where y is the number of events, n is the number of observations and m is the fitted Poisson mean.

 

The log-likelihood function is:

image\STAT0263_wmf.gif

The maximum likelihood regression proceeds by iteratively re-weighted least squares, using singular value decomposition to solve the linear system at each iteration, until the change in deviance is within the specified accuracy.

 

The Pearson chi-square residual is:

image\STAT0264_wmf.gif

 

The Pearson goodness of fit test statistic is:

image\STAT0265_wmf.gif

 

The deviance residual is (Cook and Weisberg, 1982):

image\STAT0266_wmf.gif

 

The Freeman-Tukey, variance stabilized, residual is (Freeman and Tukey, 1950):

image\STAT0267_wmf.gif

 

The standardized residual is:

image\STAT0268_wmf.gif

- where h is the leverage (diagonal of the Hat matrix).

 

Example

From Armitage et al. (2001):

Test workbook (Regression worksheet: Cancers, Subject-years, Veterans, Age group).

 

To analyse these data using StatsDirect you must first open the test workbook using the file open function of the file menu. Next generate a set of dummy variables to represent the levels of the "Age group" variable using the Dummy Variables function of the Data menu. Then select Poisson from the Regression and Correlation section of the Analysis menu. Select the column marked "Cancers" when asked for the response. Then select "Subject-years" when asked for person-time. Then select "Veterans", "Age group (25-29)" , "Age group (30-34)" etc. in one action when you are asked for predictors.

 

For this example:

 

Poisson regression

 

Deviance (likelihood ratio) chi-square = 2067.700372 df = 11 P < 0.0001

 

Intercept

b0 = -9.324832

z = -45.596773

P < 0.0001

Veterans

b1 = -0.003528

z = -0.063587

P = 0.9493

Age group (25-29)

b2 = 0.679314

z = 2.921869

P = 0.0035

Age group (30-34)

b3 = 1.371085

z = 6.297824

P < 0.0001

Age group (35-39)

b4 = 1.939619

z = 9.14648

P < 0.0001

Age group (40-44)

b5 = 2.034323

z = 9.413835

P < 0.0001

Age group (45-49)

b6 = 2.726551

z = 12.269534

P < 0.0001

Age group (50-54)

b7 = 3.202873

z = 14.515926

P < 0.0001

Age group (55-59)

b8 = 3.716187

z = 17.064363

P < 0.0001

Age group (60-64)

b9 = 4.092676

z = 18.801188

P < 0.0001

Age group (65-69)

b10 = 4.23621

z = 18.892791

P < 0.0001

Age group (70+)

b11 = 4.363717

z = 19.19183

P < 0.0001

 

log Cancers [offset log(Veterans)] = -9.324832 -0.003528 Veterans +0.679314 Age group (25-29) +1.371085 Age group (30-34) +1.939619 Age group (35-39) +2.034323 Age group (40-44) +2.726551 Age group (45-49) +3.202873 Age group (50-54) +3.716187 Age group (55-59) +4.092676 Age group (60-64) +4.23621 Age group (65-69) +4.363717 Age group (70+)

 

Poisson regression - incidence rate ratios

 

Inference population: whole study (baseline risk)

Parameter

Estimate

IRR

95% CI

Veterans

-0.003528

0.996479

0.89381 to 1.11094

Age group (25-29)

0.679314

1.972524

1.250616 to 3.111147

Age group (30-34)

1.371085

3.939622

2.571233 to 6.036256

Age group (35-39)

1.939619

6.956098

4.590483 to 10.540786

Age group (40-44)

2.034323

7.647073

5.006696 to 11.679905

Age group (45-49)

2.726551

15.280093

9.884869 to 23.620062

Age group (50-54)

3.202873

24.60311

15.96527 to 37.914362

Age group (55-59)

3.716187

41.107367

26.825601 to 62.992647

Age group (60-64)

4.092676

59.899957

39.096281 to 91.773558

Age group (65-69)

4.23621

69.145275

44.555675 to 107.305502

Age group (70+)

4.363717

78.54856

50.303407 to 122.653248

 

Poisson regression - model analysis

 

Accuracy = 1.00E-07

Log likelihood with all covariates = -66.006668

 

Deviance with all covariates = 5.217124, df = 10, rank = 12

Akaike information criterion = 29.217124

Schwartz information criterion = 45.400676

 

Deviance with no covariates = 2072.917496

Deviance (likelihood ratio, G²) = 2067.700372, df = 11, P < 0.0001

Pseudo (McFadden) R-square = 0.997483

Pseudo (likelihood ratio index) R-square = 0.939986

 

Pearson goodness of fit = 5.086063, df = 10, P = 0.8854

Deviance goodness of fit = 5.217124, df = 10, P = 0.8762

 

Over-dispersion scale parameter = 0.508606

Scaled G² = 4065.424363, df = 11, P < 0.0001

Scaled Pearson goodness of fit = 10, df = 10, P = 0.4405

Scaled Deviance goodness of fit = 10.257687, df = 10, P = 0.4182

 

Parameter

Coefficient

Standard Error

Constant

-9.324832

0.204506

Veterans

-0.003528

0.055478

Age group (25-29)

0.679314

0.232493

Age group (30-34)

1.371085

0.217708

Age group (35-39)

1.939619

0.212062

Age group (40-44)

2.034323

0.216099

Age group (45-49)

2.726551

0.222221

Age group (50-54)

3.202873

0.220645

Age group (55-59)

3.716187

0.217775

Age group (60-64)

4.092676

0.217682

Age group (65-69)

4.23621

0.224224

Age group (70+)

4.363717

0.227374

 

Parameter

Scaled Standard Error

Scaled Wald z

 

Constant

0.145847

-63.935674

P < 0.0001

Veterans

0.039565

-0.089162

P = 0.929

Age group (25-29)

0.165806

4.097037

P < 0.0001

Age group (30-34)

0.155262

8.830792

P < 0.0001

Age group (35-39)

0.151235

12.825169

P < 0.0001

Age group (40-44)

0.154115

13.200054

P < 0.0001

Age group (45-49)

0.158481

17.204308

P < 0.0001

Age group (50-54)

0.157357

20.354193

P < 0.0001

Age group (55-59)

0.15531

23.927605

P < 0.0001

Age group (60-64)

0.155243

26.362975

P < 0.0001

Age group (65-69)

0.159909

26.491421

P < 0.0001

Age group (70+)

0.162155

26.910733

P < 0.0001

 

With 95% confidence you can infer that the risk of cancer in these veterans compared with non-veterans lies between 0.89 and 1.11, i.e. a statistically non-significant effect.

 

P values

confidence intervals