How to find outliers in a dataset?

Sowjanya Sadashiva
2 min readMay 31, 2023

--

statistics — 2

what is outlier?

Outliers are values at the extreme ends of a dataset. It is an observation that lies an abnormal distance from other values in a random sample from a population.

How to identify the outliers?

There are multiple ways to identify the outliers, few of them are

1. Percentiles:

The percentage of values in a set of data scores that fall below a given value.

  • We predetermine the percentile, and any value that falls outside the given percentile is considered as outlier.
  • By empirical rule: 68–95–99.7
  • If 3SD is the threshold, data points that are far from the 99.7 percentile and less than 0.3 percentile are considered an outlier.
  • Calculating percentile:
  • value is the index in dataset.

2. Quartiles

Five number summary to remove outliers:

  • Minimum
  • First Quartile(q1)
  • Median
  • Third Quartile(Q3)
  • Maximum

Lower Fence : anything below is outlier = Q1–1.5(IQR)

Higher Fence : anything above is outlier = Q3 + 1.5(IQR)

IQR = interquartile range = Q3 — Q1

Q1 = percentile(n+1)/ 100 i.e., 25(n+1)/100

Q3 = 75(n+1)/100

Example:

therefore, In the above example anything below -8.5 and anything above 23.5 is considered as outlier.

Hence, It is clear that -10 and 50 are outliers in the data.

3. Box plot

It gives us visual representation of data, makes it clear to identify the outliers.

Values below min, above max are outliers.

--

--

Sowjanya Sadashiva
Sowjanya Sadashiva

Written by Sowjanya Sadashiva

I am a computer science enthusiast with Master's degree in Computer Science and Specialization in Data Science.

No responses yet