How to find outliers in a dataset?
statistics — 2
what is outlier?
Outliers are values at the extreme ends of a dataset. It is an observation that lies an abnormal distance from other values in a random sample from a population.
How to identify the outliers?
There are multiple ways to identify the outliers, few of them are
1. Percentiles:
The percentage of values in a set of data scores that fall below a given value.
- We predetermine the percentile, and any value that falls outside the given percentile is considered as outlier.
- By empirical rule: 68–95–99.7
- If 3SD is the threshold, data points that are far from the 99.7 percentile and less than 0.3 percentile are considered an outlier.
- Calculating percentile:
- value is the index in dataset.
2. Quartiles
Five number summary to remove outliers:
- Minimum
- First Quartile(q1)
- Median
- Third Quartile(Q3)
- Maximum
Lower Fence : anything below is outlier = Q1–1.5(IQR)
Higher Fence : anything above is outlier = Q3 + 1.5(IQR)
IQR = interquartile range = Q3 — Q1
Q1 = percentile(n+1)/ 100 i.e., 25(n+1)/100
Q3 = 75(n+1)/100
Example:
therefore, In the above example anything below -8.5 and anything above 23.5 is considered as outlier.
Hence, It is clear that -10 and 50 are outliers in the data.
3. Box plot
It gives us visual representation of data, makes it clear to identify the outliers.
Values below min, above max are outliers.