# Poisson Regression

This function fits a Poisson regression model for multivariate analysis of numbers of uncommon events in cohort studies.

The multiplicative Poisson regression model is fitted as a log-linear regression (i.e. a log link and a Poisson error distribution), with an offset equal to the natural logarithm of person-time if person-time is specified (McCullagh and Nelder, 1989; Frome, 1983; Agresti, 2002). With the multiplicative Poisson model, the exponents of coefficients are equal to the incidence rate ratio (relative risk). These baseline relative risks give values relative to named covariates for the whole population. You can define relative risks for a sub-population by multiplying that sub-population's baseline relative risk with the relative risks due to other covariate groupings, for example the relative risk of dying from lung cancer if you are a smoker who has lived in a high radon area. StatsDirect offers sub-population relative risks for dichotomous covariates.

The outcome/response variable is assumed to come from a Poisson distribution. Note that a Poisson distribution is the distribution of the number of events in a fixed time interval, provided that the events occur at random, independently in time and at a constant rate. Poisson distributions are used for modelling events per unit space as well as time, for example number of particles per square centimetre.

Poisson regression can also be used for log-linear modelling of contingency table data, and for multinomial modelling. For contingency table counts you would create r + c indicator/dummy variables as the covariates, representing the r rows and c columns of the contingency table:

 r1c1 r1c2 r1c3 r2c1 r2c2 r2c3 r3c1 r3c2 r3c3

 Response x_r1 x_r2 x_r3 x_c1 x_c2 x_c3 r1c1 1 0 0 1 0 0 r1c2 1 0 0 0 1 0 r1c3 1 0 0 0 0 1 r2c1 0 1 0 1 0 0 r2c2 0 1 0 0 1 0 r2c3 0 1 0 0 0 1 r3c1 0 0 1 1 0 0 r3c2 0 0 1 0 1 0 r3c3 0 0 1 0 0 1

In order to assess the adequacy of the Poisson regression model you should first look at the basic descriptive statistics for the event count data. If the count mean and variance are very different (equivalent in a Poisson distribution) then the model is likely to be over-dispersed.

The model analysis option gives a scale parameter (sp) as a measure of over-dispersion; this is equal to the Pearson chi-square statistic divided by the number of observations minus the number of parameters (covariates and intercept). The variances of the coefficients can be adjusted by multiplying by sp. The goodness of fit test statistics and residuals can be adjusted by dividing by sp. Using a quasi-likelihood approach sp could be integrated with the regression, but this would assume a known fixed value for sp, which is seldom the case. A better approach to over-dispersed Poisson models is to use a parametric alternative model, the negative binomial.

The deviance (likelihood ratio) test statistic, G², is the most useful summary of the adequacy of the fitted model. It represents the change in deviance between the fitted model and the model with a constant term and no covariates; therefore G² is not calculated if no constant is specified. If this test is significant then the covariates contribute significantly to the model.

The deviance goodness of fit test reflects the fit of the data to a Poisson distribution in the regression. If this test is significant then a red asterisk is shown by the P value, and you should consider other covariates and/or other error distributions such as negative binomial.

StatsDirect does not exclude/drop covariates from its Poisson regression if they are highly correlated with one another. Models that are not of full (rank = number of parameters) rank are fully estimated in most circumstances, but you should usually consider combining or excluding variables, or possibly excluding the constant term. You should seek expert statistical if you find yourself in this situation.

Technical validation

The deviance function is: - where y is the number of events, n is the number of observations and μ is the fitted Poisson mean.

The log-likelihood function is: The maximum likelihood regression proceeds by iteratively re-weighted least squares, using singular value decomposition to solve the linear system at each iteration, until the change in deviance is within the specified accuracy.

The Pearson chi-square residual is: The Pearson goodness of fit test statistic is: The deviance residual is (Cook and Weisberg, 1982): -where D(observation, fit) is the deviance and sgn(x) is the sign of x.

The Freeman-Tukey, variance stabilized, residual is (Freeman and Tukey, 1950): The standardized residual is: - where h is the leverage (diagonal of the Hat matrix).

Example

Test workbook (Regression worksheet: Cancers, Subject-years, Veterans, Age group).

To analyse these data using StatsDirect you must first open the test workbook using the file open function of the file menu. Next generate a set of dummy variables to represent the levels of the "Age group" variable using the Dummy Variables function of the Data menu. Then select Poisson from the Regression and Correlation section of the Analysis menu. Click on the option "Counts of events and exposure (person-time), and select the response data type as "Individual". Select the column marked "Cancers" when asked for the response. Then select "Subject-years" when asked for person-time. Then select "Veterans", "Age group (25-29)" , "Age group (30-34)" etc. in one action when you are asked for predictors.

For this example:

Poisson regression

Deviance (likelihood ratio) chi-square = 2067.700372 df = 11 P < 0.0001

 Intercept b0 = -9.324832 z = -45.596773 P < 0.0001 Veterans b1 = -0.003528 z = -0.063587 P = 0.9493 Age group (25-29) b2 = 0.679314 z = 2.921869 P = 0.0035 Age group (30-34) b3 = 1.371085 z = 6.297824 P < 0.0001 Age group (35-39) b4 = 1.939619 z = 9.14648 P < 0.0001 Age group (40-44) b5 = 2.034323 z = 9.413835 P < 0.0001 Age group (45-49) b6 = 2.726551 z = 12.269534 P < 0.0001 Age group (50-54) b7 = 3.202873 z = 14.515926 P < 0.0001 Age group (55-59) b8 = 3.716187 z = 17.064363 P < 0.0001 Age group (60-64) b9 = 4.092676 z = 18.801188 P < 0.0001 Age group (65-69) b10 = 4.23621 z = 18.892791 P < 0.0001 Age group (70+) b11 = 4.363717 z = 19.19183 P < 0.0001

log Cancers [offset log(Veterans)] = -9.324832 -0.003528 Veterans +0.679314 Age group (25-29) +1.371085 Age group (30-34) +1.939619 Age group (35-39) +2.034323 Age group (40-44) +2.726551 Age group (45-49) +3.202873 Age group (50-54) +3.716187 Age group (55-59) +4.092676 Age group (60-64) +4.23621 Age group (65-69) +4.363717 Age group (70+)

Poisson regression - incidence rate ratios

Inference population: whole study (baseline risk)

 Parameter Estimate IRR 95% CI Veterans -0.003528 0.996479 0.89381 to 1.11094 Age group (25-29) 0.679314 1.972524 1.250616 to 3.111147 Age group (30-34) 1.371085 3.939622 2.571233 to 6.036256 Age group (35-39) 1.939619 6.956098 4.590483 to 10.540786 Age group (40-44) 2.034323 7.647073 5.006696 to 11.679905 Age group (45-49) 2.726551 15.280093 9.884869 to 23.620062 Age group (50-54) 3.202873 24.60311 15.96527 to 37.914362 Age group (55-59) 3.716187 41.107367 26.825601 to 62.992647 Age group (60-64) 4.092676 59.899957 39.096281 to 91.773558 Age group (65-69) 4.23621 69.145275 44.555675 to 107.305502 Age group (70+) 4.363717 78.54856 50.303407 to 122.653248

Poisson regression - model analysis

Accuracy = 1.00E-07

Log likelihood with all covariates = -66.006668

Deviance with all covariates = 5.217124, df = 10, rank = 12

Akaike information criterion = 29.217124

Schwartz information criterion = 45.400676

Deviance with no covariates = 2072.917496

Deviance (likelihood ratio, G²) = 2067.700372, df = 11, P < 0.0001

Pseudo (likelihood ratio index) R-square = 0.939986

Pearson goodness of fit = 5.086063, df = 10, P = 0.8854

Deviance goodness of fit = 5.217124, df = 10, P = 0.8762

Over-dispersion scale parameter = 0.508606

Scaled G² = 4065.424363, df = 11, P < 0.0001

Scaled Pearson goodness of fit = 10, df = 10, P = 0.4405

Scaled Deviance goodness of fit = 10.257687, df = 10, P = 0.4182

 Parameter Coefficient Standard Error Constant -9.324832 0.204506 Veterans -0.003528 0.055478 Age group (25-29) 0.679314 0.232493 Age group (30-34) 1.371085 0.217708 Age group (35-39) 1.939619 0.212062 Age group (40-44) 2.034323 0.216099 Age group (45-49) 2.726551 0.222221 Age group (50-54) 3.202873 0.220645 Age group (55-59) 3.716187 0.217775 Age group (60-64) 4.092676 0.217682 Age group (65-69) 4.23621 0.224224 Age group (70+) 4.363717 0.227374

 Parameter Scaled Standard Error Scaled Wald z Constant 0.145847 -63.935674 P < 0.0001 Veterans 0.039565 -0.089162 P = 0.929 Age group (25-29) 0.165806 4.097037 P < 0.0001 Age group (30-34) 0.155262 8.830792 P < 0.0001 Age group (35-39) 0.151235 12.825169 P < 0.0001 Age group (40-44) 0.154115 13.200054 P < 0.0001 Age group (45-49) 0.158481 17.204308 P < 0.0001 Age group (50-54) 0.157357 20.354193 P < 0.0001 Age group (55-59) 0.15531 23.927605 P < 0.0001 Age group (60-64) 0.155243 26.362975 P < 0.0001 Age group (65-69) 0.159909 26.491421 P < 0.0001 Age group (70+) 0.162155 26.910733 P < 0.0001

With 95% confidence you can infer that the risk of cancer in these veterans compared with non-veterans lies between 0.89 and 1.11, i.e. a statistically non-significant effect.