Transforming Data

 

Practical issues

 

Basics

Transformation is a mathematical operation that changes the measurement scale of a variable. This is usually done to make a set of useable with a particular statistical test or method.

 

Many statistical methods require data that follow a particular kind of distribution, usually a normal distribution. All of the observations must come from a population that follows a normal distribution. Groups of observations must come from populations that have the same variance or standard deviation. Transformations that normalize a distribution commonly make the variance more uniform and vice versa.

 

If a population with a normal distribution is sampled at random then the means of the samples will not be correlated with the standard deviations of the samples. This partly explains why normalizing transformations also make variances uniform. The Central Limit Theorem (the means of a large number of samples follow a normal distribution) is a key to understanding this situation.

 

Many biomedical observations will be a product of different influences, for example the resistance of blood vessels and output from the heart are two of the influences most closely related to blood pressure. In mathematical terms these influences usually multiply together to give an overall influence, so, if we take the logarithm of the overall influence then this is the sum of the individual influences [log(A * B) = log(A) + log(B) ]. The Central Limit Theorem thus dictates that the logarithm of the product of several influences follows a normal distribution.

 

Another general rule is that any relationship between mean and variance is usually simple; variance proportional to group mean, mean square, mean to power x etc.. A transformation is used to cancel out this relationship and thus make the mean independent of the variance. The most common situation is for the variance to be proportional to the square of the mean (i.e. the standard deviation is proportional to the mean), here log transformation is used (e.g. serum cholesterol). Square root transformation is used when the variance is proportional to the mean, for example with Poisson distributed data. Observations that are counted in time and/or space (e.g. cases of meningococcal meningitis in a city in a year) often follow a Poisson distribution; here the mean is equal to the variance. With highly variable quantities such as serum creatinine then the variance is often proportional to the square of the standard deviation (i.e. mean to the power of 4); here the reciprocal transformation (1/X) is used.

 

Transformations that cancel out the relationship between variance and mean, also usually normalize the distribution of the data. Common statistical methods can then be used on the transformed data. Only some of the results of such tests, however, can be converted back to the original measurement scale of the data, the rest must be expressed in terms of the transformed variable(s) (e.g. log(serum triglyceride) as a predictor in a regression model). An example of a back-transformed statistic is the geometric mean and its confidence interval; the antilog of the mean of log-transformed data is the geometric mean and its confidence interval is the antilog of the confidence interval for the mean of the log-transformed data.

 

If you are unsure about the use of a transformation then take the advice of a statistician. There are exploratory statistical techniques (Box-Cox, QQ plots etc.) that statisticians can use to help find an optimal transformation for your data. Proper application of such techniques requires specialist statistical knowledge and skills.