Handling Outliers in Python
In this post, we will discuss about
- How to identify outliers
- How to handle the outliers
Outliers are abnormal values: either too large or too small. Causes for outliers could be
- Mistake in recording, entry or processing
- Observational error
- Or it could be a true observation
If we can identify the cause for outliers, we can then decide the next course of action. If it is due to a mistake we can try to get the true values for those observations.
If it is due to observational error, then again we can try to find the true value through calibration or through averaging. I have summarized the important concepts of observational error which has two additive errors: systematic error and random error.
Observational error/measurement error
table, th, td { border: 1px solid; }Systematic error | Random errors/statistical error | |
---|---|---|
Example | A balance showing non-zero value even when no weight is placed on it.
Let us say, if it shows 0.5 kg instead of 0, we can find out the true weight by deducting 0.5kg from the actual reading. | Measurement affected by surrounding environment |
Nature |
It is predictable.
Constant or proportional to the true value |
Not predictable. Vary from observation to another |
Can we eliminate? | Possible to eliminate. | Always present in a measurement.
Can be removed by taking multiple observations and then averaging. |
How to detect outliers
- Box Plot
For explaining, I have created a data set called data which has one column i.e. Height. In this I have incorporated two values: one which is too large (209) and the other which is too small (-200) while the mean height is 14.77. Box plot detects both these outliers.

- Interquartile Range (IQR) based method
The same concept used in box plots is used here. We identify the outliers as values less than Q1 -(1.5*IQR) or greater than Q3+(1.5*IQR).

- Standard Deviation based method
In this method, we use standard deviation and mean to detect outliers as shown below.

Histogram
Histogram also displays these outliers clearly.

Scatter Plot
If there are more than one variable and scatter plot is also useful in detecting outliers visually.
Handling Outliers
- Doing nothing
- Deleting/Trimming
Winsorizing

- Transformation Use transformation such as log transformation in case of right tailed distribution.
- Binning Binning or discretization of continuous data into groups such low, medium and high converts the outlier values into count values.
- Use robust estimators Robust estimators such as median while measuring central tendency and decision trees for classification tasks can handle the outliers better.
- Imputing Another method is to treat the outliers as missing values and then imputing them using similar methods that we saw while handling missing values.