# Diversity of Classes

Menu location: **Analysis_Nonparametric_Diversity of Classes**

This function calculates measures of diversity and an estimate of the number of classes in the population given a list of counts of observations in each class from a sample of the population.

Most of the statistical theory used here originates from work in economics (Gini, 1912) and information science (Shannon, 1948), and has been developed further in ecology, and genetics. Ecological applications usually involve studies of biodiversity, therefore the classes are species or other taxa (pl. taxon, group used by a taxonomist). In genetics the classes could be alleles (any of two or more alternative forms of a gene occupying the same chromosomal locus). These principles have been applied to other areas of study such as microbiology (Hunter and Gaston, 1988; Grundmann et al., 2001), and potentially to many more, such as community development.

This area of study is fraught with potential confusion over terms used to describe concepts. Diversity (or heterogeneity) includes both richness (the number of classes) and evenness (the distribution of individuals among classes). The most useful descriptions of diversity, therefore, present both measures of richness and evenness.

Two commonly used measures Simpson's index D_{s} and Shannon's index H'. There are many more indices and none is best for all applications (Hurlbert, 1971; Smith, 2002; Kempton, 2002; Brower et al., 1998; Krebs, 1989; Mouillot and Leprêtre, 1999). Common weaknesses of some of these indices are dependence upon a model of class abundance that you don't know in advance, variation with sample size, poor discriminatory ability for specific applications, or poor theoretical justification.

Simpson's index Ds (equal to one minus Simpson's original measure of dominance, l, later proposed by Hurlbert as PIE, the probability of inter-specific encounter) is the most meaningful measure of evenness. D_{s} is the probability that two randomly sampled individuals are from two different classes. This is equivalent to the genetic calculation of heterozygosity, H, being the probability that two alleles are not identical by descent. It follows that 1-D_{s}, or dominance l, is the probability that two randomly sampled individuals are from the same class.

- where s is the number of classes observed, n_{i} is the number observed from the ith class and N is the total number of individuals observed in the sample. Note that Hurlbert (1971) gives a different form of this equation and that the one above is better because it reduces rounding error by reducing the amount of intermediate division.

The variance for D_{s} can be estimated as:

- the second formula above gives better variance estimates for small samples than does the first (Simpson, 1949; Brower, 1998).

Shannon's index of diversity H' is derived from information theory, originally in the context of information in telephone systems (Shannon, 1948). It combines both evenness and richness in a single measure. H' has no intuitive interpretation in terms of probability and is sensitive to sample size. H' was once thought to be a measure of entropy, but this is no longer supported (Hurlbert, 1971, Goodman 1975). H' can lead to confounded comparisons where the investigator can not infer whether or not differences in H' are due to differences in richness, diversity or just sampling differences. StatsDirect calculates H' solely for consistency because it has been used widely in the past.

- where s is the number of classes observed, n_{i} is the number observed from the ith class and N is the total number of individuals observed in the sample. Note that some authors use different bases for the logarithms, giving differently scaled results, but it makes no difference which is used provided you are consistent. If you want to convert the natural log results of StatsDirect to log (base 10) results then simply multiply H' by 0.4343.

The variance for H' can be estimated as:

- the second formula above gives better variance estimates for small samples than does the first (Shannon, 1948; Nayak, 1985; Pardo et al. 1997). Note that there is an error in the second formula in Brower et al. (1998).

The large sample variance estimates above are used to calculate confidence intervals for Ds and H'. These asymptotic estimates of variance do not perform well with small samples, which can be compensated for using the small sample adjustments shown above. A better approach is to use bootstrap confidence intervals in order to get as much information as possible out of your sample. StatsDirect calculates two types of bootstrap confidence intervals for diversity indices, these are the bootstrap refinement of the normal asymptotic interval (Mills and Zandvakili, 1997; Dixon et al., 1987; Efron and Tibshirani, 1997):

- where g is either the Simpson or Shannon statistic calculated from the observed sample, k is the number bootstrap resamples, g star is the statistic of interest calculated from a bootstrap sample, SE_{b} is the bootstrap estimate of standard error and t is a quantile of the Student t distribution.

…and a the symmetrized bootstrap-t interval (Vives et al., 2002; Hall 1988):

- where G is the estimated bootstrap distribution of the absolute value of the studentized sample diversity index.

The resampling scheme used for the bootstrap intervals above is the allocation of one observation to each of s classes followed by allocation at random of the remaining N-s observations to the s classes. This scheme keeps a constant number of classes in each bootstrap sample.

Vives et al. (2002) showed that percentile methods (including BCa) do not perform well for Shannon's index, they advise using a symmetrized bootstrap-t confidence interval.

StatsDirect also extrapolates the richness (number of classes) in your sample in order to give an estimate of the number of classes in the population. There are different approaches to this extrapolation, a well-founded method that does not assume a model of class abundance is that of Chao (1984) as discussed by Colwell and Coddington (1994).

- where S is the estimate of the total number of classes in the population, s is the number of classes observed in the sample, a is the number of classes with exactly one individual (singletons), b is the number of classes with exactly two individuals (doubletons), and where 1 is substituted for a or b if either has no singletons or doubletons.

__Example__

Test workbook (Nonparametric worksheet: Community (RAPD)).

Consider the following counts of numbers of types of Staphylococcus aureus strains found in hospital samples (Grundmann et al., 2001).

CR1 | 30 | CR8 | 6 |

CR2 | 13 | CR9 | 6 |

CR3 | 9 | CR10 | 5 |

CR4 | 8 | CR11 | 2 |

CR5 | 7 | CR12 | 2 |

CR6 | 7 | CR13 | 2 |

CR7 | 7 | CR14-26 | 1 |

To test these data for diversity using StatsDirect you must first prepare them in a workbook column. Alternatively, open the test workbook using the file open function of the file menu. Then select the Diversity item from the Nonparametric section of the analysis menu. Select the column marked "Community (RAPD)" when prompted for data.

For this example:

__Analysis for Community (RAPD)__

Total number of counts = 117

Number of classes observed (richness) = 26

Estimated total number of classes = 54

Standard error (large sample) = 16.196498

Normal (large sample) 95% CI = 22 to 86

**Simpson Ds (Hurlburt PIE)** = 0.899352, (dominance l = 0.100648, ds = 9.935578)

Standard error (large sample) = 0.016826

Standard error (small sample) = 0.017173

Normal (large sample) 95% CI = 0.866373 to 0.932331

Re-samples = 2000, bias = -0.022018, standard error (bootstrap) = 0.011455

Normal (bootstrap) 95% CI = 0.876886 to 0.921817

**Shannon H' (base e)** = 2.656515

Standard error (large sample) = 0.095839

Standard error (small sample) = 0.10049

Normal (large sample) 95% CI = 2.468673 to 2.844356

Re-samples = 2000, bias = -0.16601, standard error (bootstrap) = 0.065119

Normal (bootstrap) 95% CI = 2.528806 to 2.784223