Robust Fitting and Complex Models
Most researchers are familiar with standard kinetics, Michaelis-Menten and dose response curves, but there are many more available modern techniques of analysis that allow you to get greater value from data. This article discusses the methods used in curve fitting today, including Iteratively Re-weighted Least Squares (IRLS) which is also known as robust fitting. The constraints of this technique are also explored, including the reasons why robust fitting is now more widely accepted and used today after its introduction some 20 years ago. The principles behind complex models, and how they can be applied, are also discussed.
Quick introduction to weights
By default, equal weight is given to every data point in a curve fit. In order to determine the best fit, standard weighting is calculated by the Levenberg-Marquardt algorithm (LVM), which minimizes the sum-of-squares of the vertical distance between the observed data and the curve or fitted data (residuals).
By default, LVM minimizes:
Σ (Ydata – Yfit)2
Unequal weighting can be assigned according to any scheme.
If weight values are assigned, LVM minimizes:
Σ [(Ydata – Yfit)/Weight]2
The lower the weight (closer to 0), the higher the values bearing on the fit.
Unequal weight can be assigned to data points within a certain tolerance, so that all points are included in the analysis but those with less weight are given less bearing and meaning to the ultimate result.
For example, an instrument may have a certain data range within which it guarantees a high level of accuracy but when the limits of that range are exceeded, the tolerances in the accuracy of that instrument decreases. In this scenario, more bearing or weight can be given to the data points within the instrument’s tolerances, and outside of that range, data points will still be included in the analysis but they have less bearing on the ultimate result.
A set of weighting values can be applied that will reflect that assumption and reduce the impact of any outliers outside the tolerance ranges in the fitting process.
IRLS (Robust Fitting)
Standard regression analysis is very prone to outliers and even a single outlier will affect results considerably, as shown in Fig 1 below. Knocking out an individual outlier improves the curve fit considerably, as shown in Fig 2.
Fig 1: Even one outlying data point can significantly affect the quality of a fit
Fig 2: Knocking out the outlier considerable improves results
Robust fitting is an extension of standard regression (standard non-linear Least Squares Fitting (LSF)) that can even out individual outliers in a data set and neutralize their effect on the ultimate result.
Robust fitting was introduced about 20 years ago but was not initially widely accepted because of the many competing techniques available at the time and a lack of understanding of the most appropriate way to use it. Another reason for the general reluctance to widely adopt IRLS was its computationally-intensive nature. Standard non-linear LSF processes could be calculated by writing on paper using standard math techniques, but the more robust technique of IRLS was much harder to perform in the same way. Early curve-fitting software packages were not able to employ robust fitting, making the technique and its algorithms mostly unavailable to the mainstream.
IRLS (Robust Fitting)
A fitting process is iterative and, on each iteration, the fitting algorithm changes parameter values based on the data set provided in order to converge on best results.
Robust fitting introduces another variable to the fitting process, by varying individual weights for individual data points as well as parameter values. Thus on each cycle of the iteration, the weighting values for each data point are changed to enable the fit to converge at the best fit for the data. So if there is an outlier in the data set, it will be significantly down weighted to achieve a more robust and better fit for the rest of the data set.
There are many IRLS techniques available, but the six major most commonly used are:
- Tukey’s Biweight*
- Andrew’s Sine*
*Undefined over complete error space resulting in outliers being ‘removed’
Tukey’s Biweight and Andrew’s Sine are the most commonly used, and because they are not defined over a whole error space, these two techniques differ slightly compared to the other four. For example, when employing Tukey’s Biweight and Andrew’s Sine, if a data point is given a weighting value that might be significantly low, it is construed as an outlier and removed from the data set. This occurs in curve-fitting applications such as XLfit when a user chooses to automatically remove outlying points from a data set.
Note:The other four techniques down-weigh outlying points so they have no bearing on the fit at all, which is equivalent to knocking them out. It is possible to combine IRLS with manual outlier knock-out where appropriate.
In the IRLS fitting scenarios below in Fig 3, Tukey’s Biweight is performed on three different sets of data, which are all well defined but contain easily identifiable outliers. IRLS has removed these data points from the set, making manual interaction unnecessary because the fit is of significant quality to be confident in the results produced.
Note:These data sets are well defined from a data perspective and are complete, with a reasonably high number of data points. Applying IRLS to well formed data sets enables the analysis to be of significant quality and the process to produce accurate results. Much like standard non-linear LSF, robust fitting does not work if there are any errors in the X value.
Fig 3: IRLS fitting improves fit results accuracy when applied to three different data sets
For a data set with a large amount of scatter, the process involves re-weighting and changing the weight of each point. It is very difficult for the fitting process to converge on a positive result and a single best fit for such data. IRLS does require that the data fitted is of a significant quality, otherwise it is prone to failure. It is recommended that IRLS is always used in conjunction with other data quality checks to ensure good results.
IRLS (Robust Fitting)
The graph below illustrates how the IRLS process works. The blue line proceeds to infinity and if we assume that the vertical axis is showing some level of impact on the curve fit for an individual outlier, the residual value – the distance from the fitted curve – increases. So the further the point is away from the fitted line, the higher the point outlier status is, and the more the impact on the curve fit. The red line is the IRLS fit. For one given individual outlier in the data set, as its outlier status increases, its impact on the fit decreases and eventually reaches 0.
Fig 4: The impact of IRLS on an outlier compared to standard least-squares regression
Data fitting and analysis is not just confined to basic Michaelis-Menten and dose response models. Complex modelling can be used to analyze different types of data using standard non-linear LSF. The example below in Fig 5 shows time-controlled drug delivery with a number of different parameters being measured, while a drug is administered at different time points in a pulsed nature. The graph is analyzing absorption of the drug into the blood stream over time, indicated by the wave-like fit, allowing the researcher to determine the cycle and rate at which the drug is distributed.
Fig 5: Analyzing the cycle and rate at which drug absorption occurs
Composite models such as the those shown in Fig 6 allow us to analyze a data set using two different models. For example, the researcher fits the first model up to a point in time until the data points start to go back down when the model is changed to analyze a different phase within the data. Although this is a complex model, it allows the researcher to fit results to a high degree of confidence.
Fig 6: Fitting composite data occurs
Fig 7 below shows a common scenario where data is fitted to a standard dose response curve but the data points start decreasing at the end of the measurements. The researcher can set up a technique to remove those final data points, such as applying an IRLS fitting technique to eliminate those points that start to drop off.
Alternatively the researcher can use a model that has been constructed to tackle this kind of scenario. A bell-shaped dose response model allows the extraction of data points at the bottom and top, so that parameters C1 and C2 can be extracted as the EC50 values for these two linked dose response curves, with measureable slope factors for both curves. Bell-shaped models provide an effective means of analyzing and interpreting a whole set of data, as opposed to having to reject data points.
A scenario such as this comes up frequently in standard dose response analysis. If the last six points of this example were knocked out and a standard dose response curve performed on the data set, the results for the first curve in the bell-shaped model would display similar or exactly the same results as the standard dose response curve.
Fig 7: A bell-shaped dose response model producing twofit results
IRLS provides an advanced technique for reducing and neutralizing the effects of outliers in a fit. By weighting individual data points, IRLS can increase the accuracy of fit results compared to those achieved using standard regression (standard non-linear LSF). Both techniques, however, must be applied to a well defined and complete data set in order to produce quality results.
Curve fitting is a flexible process offering a range ofdata analysis types, and researchers do not have to be constrained by standard analysis techniques. Providing a variety of innovative ways of applying data analysis to extract required results in varying scenarios, complex models extend data fitting and analysis beyond basic Michaelis-Menten and dose response models and can be used in a wide range of applications.