Agreement of Categorical Measurements


Menu locations:
Analysis_Agreement_Categorical (Kappa)
Analysis_Chi-square_Kappa and Maxwell
Analysis_Clinical Epidemiology_Kappa and Maxwell


Agreement Analysis

For the case of two raters, this function gives Cohen's kappa (weighted and unweighted), Scott's pi and Gwett's AC1 as measures of inter-rater agreement for two raters' categorical assessments (Fleiss, 1981; Fleiss, 1969; Altman, 1991; Scott 1955). For three or more raters, this function gives extensions of the Cohen kappa method, due to Fleiss and Cuzick (1979) in the case of two possible responses per rater, and Fleiss, Nee and Landis (1979) in the general case of three or more responses per rater.


If you have only two categories then Scott's pi statistic (with confidence interval constructed by the Donner-Eliasziw (1992) method) for inter-rater agreement (Zwick, 1988) is more reliable than kappa.


Gwet's AC1 is the statistic of choice for the case of two raters (Gwet, 2008). Gwet's agreement coefficient, can be used in more contexts than kappa or pi because it does not depend upon the assumption of independence between raters.


Weighted kappa partly compensates for a problem with unweighted kappa, namely that it is not adjusted for the degree of disagreement. Disagreement is weighted in decreasing priority from the top left (origin) of the table. StatsDirect uses the following definitions for weight (1 is the default):

  1. w(ij)=1-abs(i-j)/(g-1)
  2. w(ij)=1-[(i-j)/(g-1)]²
  3. User defined (this is only available via workbook data entry)

g = categories

w = weight

i = category for one observer (from 1 to g)

j = category for the other observer (from 1 to g)


In broad terms a kappa below 0.2 indicates poor agreement and a kappa above 0.8 indicates very good agreement beyond chance.


Guide (Landis and Koch, 1977):

Kappa Strength of agreement
< 0.2 Poor
> 0.2 ≤ 0.4 Fair
> 0.4 ≤ 0.6 Moderate
> 0.6 ≤ 0.8 Good
> 0.8 ≤ 1 Very good


N.B. You can not reliably compare kappa values from different studies because kappa is sensitive to the prevalence of different categories. i.e. if one category is observed more commonly in one study than another then kappa may indicate a difference in inter-rater agreement which is not due to the raters.


Agreement analysis with more than two raters is a complex and controversial subject, see Fleiss (1981, p. 225).


Disagreement Analysis

StatsDirect uses the methods of Maxwell (1970) to test for differences between the ratings of the two raters (or k nominal responses with paired observations).


Maxwell's chi-square statistic tests for overall disagreement between the two raters. The general McNemar statistic tests for asymmetry in the distribution of subjects about which the raters disagree, i.e. disagreement more over some categories of response than others.


Data preparation

You may present your data for the two-rater methods as a fourfold table in the interactive screen data entry menu option. Otherwise, you may present your data as responses/ratings in columns and rows in a worksheet, where the columns represent raters and the rows represent subjects rated. If you have more than two raters then you must present your data in the worksheet column (rater) row (subject) format. Missing data can be used where raters did not rate all subjects.


Technical validation

All formulae for kappa statistics and their tests are as per Fleiss (1981):

For two raters (m=2) and two categories (k=2):

- where n is the number of subjects rated, w is the weight for agreement or disagreement, po is the observed proportion of agreement, pe is the expected proportion of agreement, pij is the fraction of ratings i by the first rater and j by the second rater, and so is the standard error for testing that the kappa statistic equals zero.


For three or more raters (m>2) and two categories (k =2):

- where xi is the number of positive ratings out of mi raters for subject i of n subjects, and so is the standard error for testing that the kappa statistic equals zero.


For three or more raters and categories (m>2, k>2):

- where soj is the standard error for testing kappa equal for each rating category separately, and so bar is the standard error for testing kappa equal to zero for the overall kappa across the k categories. Kappa hat is calculated as for the m>2, k=2 method shown above.



From Altman (1991).


Altman quotes the results of Brostoff et al. in a comparison not of two human observers but of two different methods of assessment. These methods are RAST (radioallergosorbent test) and MAST (multi-RAST) for testing the sera of individuals for specifically reactive IgE in the diagnosis of allergies. Five categories of result were recorded using each method:


    negative weak moderate high very high
MAST: negative 86 3 14 0 2
weak 26 0 10 4 0
moderate 20 2 22 4 1
high 11 1 37 16 14
very high 3 0 15 24 48


To analyse these data in StatsDirect select Categorical from the Agreement section of the Analysis menu. Choose the default 95% confidence interval. Enter the above frequencies as directed on the screen and select the default method for weighting.


For this example:


General agreement over all categories (2 raters)


Cohen's kappa (unweighted)

Observed agreement = 47.38%

Expected agreement = 22.78%

Kappa = 0.318628 (se_0 = 0.026776, se = 0.030423)

95% confidence interval = 0.259 to 0.378256

z (for k = 0) = 11.899574

P < 0.0001


Cohen's kappa (weighted by 1-Abs(i-j)/(1 - k))

Observed agreement = 80.51%

Expected agreement = 55.81%

Kappa = 0.558953 (se_0 = 0.038019, se = 0.028507)

95% confidence interval = 0.503081 to 0.614826

z (for kw = 0) = 14.701958

P < 0.0001


Scott's pi

Observed agreement = 47.38%

Expected agreement = 24.07%

Pi = 0.30701


Gwet's AC1

Observed agreement = 35.06%

Chance-independent agreement = 18.98%

AC1 = 0.350552 (se = 0.033046)

95% confidence interval = 0.285782 to 0.415322


Disagreement over any category and asymmetry of disagreement (2 raters)

Marginal homogeneity (Maxwell) chi-square = 73.013451, df = 4, P < 0.0001

Symmetry (generalised McNemar) chi-square = 79.076091, df = 10, P < 0.0001


Note that for calculation of standard errors for the kappa statistics, StatsDirect uses a more accurate method than that which is quoted in most textbooks (e.g. Altman, 1990).


The statistically highly significant z tests indicate that we should reject the null hypothesis that the ratings are independent (i.e. kappa = 0) and accept the alternative that agreement is better than one would expect by chance. Do not put too much emphasis on the kappa statistic test, it makes a lot of assumptions and falls into error with small numbers.


Note that Gwet's agreement coefficient does not depend upon the hypothesis of independence between raters, therefore you can use it to reflect the extent of agreement in more contexts than kappa.


The statistically highly significant Maxwell test statistic above indicates that the raters disagree significantly in at least one category. The generalised McNemar statistic indicates the disagreement is not spread evenly.


confidence intervals

P values