Analysis_Descriptive_Weighted Univariate Summary.
This function provides measures of location and dispersion which describe the data in a worksheet column. You are given the number, arithmetic mean, sum, variance, standard deviation, standard error of the arithmetic mean, coefficient of variance, confidence interval for the arithmetic mean, geometric mean, coefficient of skewness, coefficient of kurtosis, maximum, upper quartile, median, lower quartile, minimum and range for each selected variable. You can also choose to calculate an additional quantile and this is appended to the results listed above. Incalculable results are displayed as missing data using an asterisk (*).
If you select more than one column of data to describe then you are given an option to save the results to worksheet columns. Saved columns of results represent the statistics, mean, median etc., and their rows represent the variables/columns you selected to describe.
Confidence limits (boundaries of the confidence interval) are given for the arithmetic mean. Please see quantile confidence interval for confidence intervals for the median and other measures of location.
Some related topics:
- central tendency
- variance, standard deviation and spread
- normal distribution
- quantile confidence intervals
Please refer to one of the general textbooks listed in the reference sectionfor discussion of the application and relative merits of individual descriptive statistics.
Valid data and missing data
For each worksheet column that you select, the number of valid data are the number of cells that can be interpreted as numbers, the remaining cells that can not be interpreted as numbers are counted as missing (e.g. empty cell, asterisk or text label). The sample size used in the calculations below is the number of valid data.
Sum, mean, variance, standard deviation, standard error and variance coefficient
- where Σ is the summation for all observations (xi) in a sample, x bar is the sample (arithmetic) mean, n is the sample size, s² is the sample variance, s is the sample standard deviation, SEM is the standard error of the sample mean, upper and lower CL are the confidence limits of the confidence interval for the mean, tα, n-1 is the (100*a)% two tailed quantile from the Student t distribution with n-1 degrees of freedom, and VC is the variance coefficient.
Skewness and kurtosis
- where Σ is the summation for all observations (xi) in a sample, x bar is the sample mean and n is the sample size. Note that there are other definitions of these coefficients used by some other statistical software. StatsDirect uses the standard definitions for which critical values are published in standard statistical tables (Pearson and Hartley, 1970; Stuart and Ord, 1994).
The geometric mean is a useful measure of central tendency for samples that are log-normally distributed (i.e. the logarithms of the observations are from an approximately normal distribution). The geometric mean is not calculated for samples that contain negative values.
- where Σ is the summation for all observations (xi) in a sample, ln is the natural (base e) logarithm, exp is the exponent (anti-logarithm for base e), GM is the sample geometric mean and n is the sample size.
If weights are selected then the weights that you supply are first normalised so that they sum to the total number of observations n:
- where vi is a user supplied weight and wi is the normalised weight.
The following formulae replace the mean, variance and moments calculations defined above when weights are used:
Median, quartiles and range
For samples that are not from an approximately normal distribution, for example when data are censored to remove very large and/or very small values, the following nonparametric statistics should be used in place of the arithmetic mean, its variance and the other parametric measures above.
Median (50th centile, quantile 0.5), lower quartile (25th centile, quantile 0.25) and upper quartile (75th centile, quantile 0.75) are defined generally as quantiles:
Two different quantile definitions (Weisberg, 1992; Gleason, 1997; Stuart and Ord, 1994 are used in the summary statistics (see also: quantiles): the first is the conventional quantile that is also used in the quantile confidence interval function and the second allows for weights:
- where p is a proportion, Q is the pth quantile (e.g. median is Q(0.5)), fix is the integer part of a real number, h is the fractional part of order statistic i, u is an observation from a sample after it has been ordered from smallest to largest value and n is the sample size.
- where p is a proportion, Q is the pth quantile (e.g. median is Q(0.5)), u is an observation from a sample after it has been ordered from smallest to largest value, n is the sample size, w is a weight normalised so that it sums to n and
The computational methods used in StatsDirect univariate summary statistics, including this function, provide 15 decimal places of precision. This is tested against known standards such as the reference data set used in the example below.
Test workbook (Parametric worksheet: Michelson).
The data are 100 measurements of the speed (millions of meters per second) of light in air recorded by Michelson in 1879 (Dorsey, 1944). The American National Institute of Standards and Technology use these data as part of the Statistical Reference Datasets for testing statistical software (McCullough and Wilson, 1999; http://www.itl.nist.gov/div898/strd/).
Open the test workbook and select the "Michelson" column. Choose Univariate Summary from the Descriptive section of the analysis menu and click on OK when you see a list of descriptive statistics options.
Results from StatsDirect (with decimal places in Analysis_Options set to 12 and centile type 2 selected):
|Standard error of mean||0.007901054782|
|Upper 95% CL of mean||299.868077406834|
|Lower 95% CL of mean||299.836722593166|