Univariate summary

 

Menu locations:

Analysis_Descriptive_Univariate Summary;

Analysis_Descriptive_Weighted Univariate Summary.

 

This function provides measures of location and dispersion which describe the data in a worksheet column. You are given the number, arithmetic mean, sum, variance, standard deviation, standard error of the arithmetic mean, coefficient of variance, confidence interval for the arithmetic mean, geometric mean, coefficient of skewness, coefficient of kurtosis, maximum, upper quartile, median, lower quartile, minimum and range for each selected variable. You can also choose to calculate an additional quantile and this is appended to the results listed above. Incalculable results are displayed as missing data using an asterisk (*).

 

If you select more than one column of data to describe then you are given an option to save the results to worksheet columns. Saved columns of results represent the statistics, mean, median etc., and their rows represent the variables/columns you selected to describe.

 

Confidence limits (boundaries of the confidence interval) are given for the arithmetic mean. Please see quantile confidence interval for confidence intervals for the median and other measures of location.

 

Some related topics:

 

Please refer to one of the general textbooks listed in the reference section for discussion of the application and relative merits of individual descriptive statistics.

 

Definitions

Valid data and missing data:

For each worksheet column that you select, the number of valid data are the number of cells that can be interpreted as numbers, the remaining cells that can not be interpreted as numbers are counted as missing (e.g. empty cell, asterisk or text label). The sample size used in the calculations below is the number of valid data.

 

Sum, mean, variance, standard deviation, standard error and variance coefficient:

image\STAT0134_wmf.gif

image\STAT0135_wmf.gif

image\STAT0136_wmf.gif

image\STAT0137_wmf.gif

image\STAT0138_wmf.gif

- where S is the summation for all observations (xi) in a sample, x bar is the sample (arithmetic) mean, n is the sample size, s² is the sample variance, s is the sample standard deviation, sem is the standard error of the sample mean, upper and lower CL are the confidence limits of the confidence interval for the mean, ta, n-1 is the (100*a)% two tailed quantile from the Student t distribution with n-1 degrees of freedom, and vc is the variance coefficient.

 

Skewness and kurtosis:

image\STAT0139_wmf.gif

 

image\STAT0140_wmf.gif

 

- where S is the summation for all observations (xi) in a sample, x bar is the sample mean and n is the sample size. Note that there are other definitions of these coefficients used by some other statistical software. StatsDirect uses the standard definitions for which critical values are published in standard statistical tables (Pearson and Hartley, 1970; Stuart and Ord, 1994).

 

Geometric mean:

The geometric mean is a useful measure of central tendency for samples that are log-normally distributed (i.e. the logarithms of the observations are from an approximately normal distribution). The geometric mean is not calculated for samples that contain negative values.

image\STAT0141_wmf.gif

- where S is the summation for all observations (xi) in a sample, ln is the natural (base e) logarithm, exp is the exponent (anti-logarithm for base e), gm is the sample geometric mean and n is the sample size.

 

Weights:

If weights are selected then the weights that you supply are first normalised so that they sum to the total number of observations n:

image\STAT0142_wmf.gif

- where vi is a user supplied weight and wi is the normalised weight.

 

The following formulae replace the mean, variance and moments calculations defined above when weights are used:

image\STAT0143_wmf.gif

 

Median, quartiles and range:

For samples that are not from an approximately normal distribution, for example when data are censored to remove very large and/or very small values, the following nonparametric statistics should be used in place of the arithmetic mean, its variance and the other parametric measures above.

 

Median (50th centile, quantile 0.5), lower quartile (25th centile, quantile 0.25) and upper quartile (75th centile, quantile 0.75) are defined generally as quantiles:

 

Two different quantile definitions (Weisberg, 1992; Gleason, 1997; Stuart and Ord, 1994) are used in the summary statistics, the first allows for weights and the second is the conventional quantile that is also used in the quantile confidence interval function:

Type 1

image\STAT0144_wmf.gif

- where p is a proportion, Q is the pth quantile (e.g. median is Q(0.5)), u is an observation from a sample after it has been ordered from smallest to largest value, n is the sample size, w is a weight normalised so that it sums to n and

 

Type 2

image\STAT0145_wmf.gif

- where p is a proportion, Q is the pth quantile (e.g. median is Q(0.5)), fix is the integer part of a real number, h is the fractional part of order statistic i, u is an observation from a sample after it has been ordered from smallest to largest value and n is the sample size.

 

Technical validation

The computational methods used in StatsDirect univariate summary statistics, including this function, provide 15 decimal places of precision. This is tested against known standards such as the reference data set used in the example below.

 

Example

Test workbook (Parametric worksheet: Michelson).

 

The data are 100 measurements of the speed (millions of meters per second) of light in air recorded by Michelson in 1879 (Dorsey, 1944). The American National Institute of Standards and Technology use these data as part of the Statistical Reference Datasets for testing statistical software (McCullough and Wilson, 1999; http://www.nist.gov.itl/div898/strd).

 

Open the test workbook and select the "Michelson" column. Choose descriptive report from the descriptive section of the analysis menu and click on OK when you see a list of descriptive statistics options.

 

Results from StatsDirect (with decimal places in Analysis_Options set to 12 and centile type 2 selected):

 

Descriptive statistics

Variables

Michelson

Valid data

100

Missing data

0

Sum

29985.24

Mean

299.8524

Variance

0.006242666667

Standard deviation

0.079010547819

Variance coefficient

0.000263498134

Standard error of mean

0.007901054782

Upper 95% CL of mean

299.868077406834

Lower 95% CL of mean

299.836722593166

Geometric mean

299.852389694496

Skewness

-0.01825961396

Kurtosis

3.263530532311

Maximum

300.07

Upper quartile

299.895

Median

299.85

Lower quartile

299.805

Minimum

299.62

Range

0.45

Centile 95

299.98

Centile 5

299.73