LOESS Curve Fitting (Local Polynomial Regression)

Menu location: Analysis_LOESS.

This is a method for fitting a smooth curve between two variables, or fitting a smooth surface between an outcome and up to four predictor variables.

The procedure originated as LOWESS (LOcally WEighted Scatter-plot Smoother). Since then it has been extended as a modelling tool because it has some useful statistical properties (Cleveland, 1998).

This is a nonparametric method because the linearity assumptions of conventional regression methods have been relaxed. Instead of estimating parameters like m and c in y = mx +c, a nonparametric regression focuses on the fitted curve. The fitted points and their standard errors represent are estimated with respect to the whole curve rather than a particular estimate. So, the overall uncertainty is measured as how well the estimated curve fits the population curve.

It is called local regression because the fitting at say point x is weighted toward the data nearest to x. The distance from x that is considered near to it is controlled by the span setting, α.When α is less than 1 it represents the proportion of the data that is considered to be neighbouring x, and the weighting that is used is proportional to 1-(distance/maximum distance)^3)^3, which is known as tricubic.When α is greater than 1 all of the points are used and the maximum distance is taken as α^(1/p) times the observed maximum distance for p predictors. The default span is α = 0.75. If you choose a span that is too small then there will be insufficient data near x for an accurate fit, resulting in a large variance. If the span is too large than the regression will be over-smoothed, resulting in a loss of information, hence a large bias.

The trade-off between bias and variance also depends on the degree of the polynomial selected. A high degree will provide a better approximation of the population mean, so less bias, but there are more factors to consider in the model, resulting in greater variance. The default degree is 2 (quadratic). Higher degrees don't improve the fit much. The lower degree (i.e. 1, linear) has more bias but pulls back variance at the boundaries.

There is no substitute for thinking carefully about what you are plotting, testing different settings of span and polynomial degree, and selecting the most plausible fit by eye. The summary statistics also give an indication of how well the model fits.

The concept of degrees of freedom for nonparametric models is complex. They approximate the parametric concept of degrees of freedom empirically and result in numbers that are not necessarily integers.

The assumptions are:

Around point x the mean of y can be approximated by a small class of parametric functions in polynomial regression.
The errors in estimating y are independent and randomly distributed with a mean of zero.
Bias and variance are traded off by the choices for the settings of span and degree of polynomial.

Technical Validation

The LOESS function in the {stats} package of R is called. You must have R installed on the computer from which you are running StatsDirect. You can download and install R here. The R implementation is based on the cloess algorithm, for which the original authors have a NETLIB site.

Example

Test workbook (Nonparametric worksheet: Hgb, eGFR).

The data are red blood cell concentrations (Hgb) and kidney function estimates (eGFR) taken from a cohort of type 2 diabetics. The hypothesis is that as diabetes causes chronic kidney disease patients become anaemic as their kidneys are sending less of a signal to their bone marrow to produce more red blood cells.

Open the test workbook using the file open function of the file menu. Then select LOESS from the Nonparametric section of the analysis menu. Select the columns marked "Hgb" and "eGFR" when prompted for outcome and predictor variables respectively.

Accept the default span of 0.75 and polynomial degree = 2. Check the box for the plot and edit the title and axis labels to "Anaemia vs. Kidney Function in Diabetics"; "Haemoglobin Concentration (g/L)" for the vertical axis; and "Estimated Glomerular Filtration Rate (ml/min/1.73m²)" for the horizontal axis.

For this example:

Hgb vs. eGFR

Number of Observations: 804

Polynomial degree: 2, Span: 0.75

Equivalent Number of Parameters: 5.464141

Residual Standard Error: 15.159968