Reliability

Reliability in scientific investigation usually means the stability and repeatability of measures, or the ability of a test to produce the same results under the same conditions.

For a test to be reliable it must first be valid.

Stability is determined by random and systematic errors of the measure and the way the measure is applied in a study.

If a measure has a large random error, i.e. it is very noisy, it can not reliably discriminate differences that may be important.

If a measure has a large systematic error, for example weighing scales that always weight 1Kg too light, then it is biased or inaccurate. Biased measures can be very precise and repeatable, but in such situations they are precisely and repeatably under-measuring or over-measuring the true value, so they are unstable and thus invalid. Biases can be subtle, for example the use of a Catell B intelligence quotient test in a population where English is not the first language may under measure IQ in those who do not communicate primarily in English.

In the context of questionnaires, it may be difficult to strike an acceptable balance between stability and the practicality. This is because investigators often seek to find out as much information as possible whilst they have the attention of someone completing a questionnaire. This results in questionnaires that are multi-dimensional, i.e. they are measuring more than one construct/concept. The responses to the elements (questions) in such questionnaires might not correlate well with one another. On grounds of reliability, test should ideally be one-dimensional. Whether you are aiming for a one-dimensional test or accepting the weaknesses of a multi-dimensional test, there is a balance to strike between how well the element of the test hang together and the potential increase in validity that is gained by measuring a construct in multiple ways.

Ideally we would measure the reliability of a test by repeating it many times on a large group of individuals, and analysing the results using statistical methods for measuring agreement. Usually, however, this is not possible because of practical difficulties; changes in what is being measured over time differently in different individuals; and confounding effects such as individuals remembering their responses from previous applications of the test. To overcome these problems, statisticians developed the split-half study design where the elements of a test are split at random into halves and the test score for an individual is calculated twice, once with one half of the test elements and once with the other half of the test elements. Tests with strong internal consistency show strong correlation between the scores calculated from the two halves. Cronbach extended this idea to consider every possible way of splitting the test into its component elements, resulting in Cronbach's alpha coefficient for scale reliability. By convention, alpha of 0.8 or more is considered acceptable in questionnaire design.