Outliers

Handling Outliers in Python

In this post, we will discuss about

How to identify outliers
How to handle the outliers

Outliers are abnormal values: either too large or too small. Causes for outliers could be

Mistake in recording, entry or processing
Observational error
Or it could be a true observation

If we can identify the cause for outliers, we can then decide the next course of action. If it is due to a mistake we can try to get the true values for those observations.

If it is due to observational error, then again we can try to find the true value through calibration or through averaging. I have summarized the important concepts of observational error which has two additive errors: systematic error and random error.

Observational error/measurement error

table, th, td { border: 1px solid; }

	Systematic error	Random errors/statistical error
Example	A balance showing non-zero value even when no weight is placed on it. Let us say, if it shows 0.5 kg instead of 0, we can find out the true weight by deducting 0.5kg from the actual reading.	Measurement affected by surrounding environment
Nature	It is predictable. Constant or proportional to the true value	Not predictable. Vary from observation to another
Can we eliminate?	Possible to eliminate.	Always present in a measurement. Can be removed by taking multiple observations and then averaging.

How to detect outliers

Histogram

Histogram also displays these outliers clearly.

Scatter Plot

If there are more than one variable and scatter plot is also useful in detecting outliers visually.

Handling Outliers

If we can’t rectify the outliers, then we may think of some the following methods to handle outliers.

Doing nothing

Deleting/Trimming

Be careful as this may lead sampling bias. As professor Patrick Breheny points out throwing away outliers may be simplest method but it threatens scientific integrity and objectivity. He cites the example of how NASA missed detecting hole in the ozone layer thinking that it might be an outlier data.

After deleting the outliers, we should be careful not to run the outlier detection test once again. As the IQR and standard deviation changes after the removal of outliers, this may lead to wrongly detecting some new values as outliers.

Winsorizing

Unlike trimming, here we replace the outliers with other values. Common is replacing the outliers on the upper side with 95% percentile value and outlier on the lower side with 5% percentile.

Transformation
Binning
Use robust estimators
Imputing

handling missing values.

Outliers

Handling Outliers in Python

How to detect outliers

Histogram

Scatter Plot

Handling Outliers

Winsorizing

References/Further reading:

More Posts

Ethical Considerations in AI-Powered Education

Mastering Central Limit Theorem (CLT) with Intuitive Examples

Importing Time Series data

Stay Updated

Stay Updated

REQUEST DEMO