Handling Missing values — part 2

4 min readMay 22, 2023

There are various techniques to handle missing values.

Mode/Median/Mean value replacement
Remove entire record
Most frequent value replacement
Model based imputation
Interpolation/Extrapolation
forward filling/ Backward filling
Hot deck imputation

Null value replacement

There are multiple reasons for missing values, these missing values are often represented as Nan(Not a Number), Null. These missing values needs to be taken care of before modeling the data.

Which technique to use completely depends on the dataset.

1. Mode/Median/Mean value replacement

df.isnull().sum() prints the column with missing value.
sklearn.impute.SimpleImputer is used for missing values imputation using mean, median, mode.
Plots such as box plots and density plots come very handily in deciding which techniques to use.
Mean: It is often not a good measurement. when data is skewed, it affects the mean. An outlier can change the mean significantly. This can be used when the variable is approximately normal and when the outlier is highly unlikely. This can not be used for categorical features.
Median: it is less to outliers compared to mean, it is preferred when the data is skewed.
Mode: It is suitable for categorical variables or numerical variables with a small number of unique values.

#for mean replacement

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mean.transform(X).ravel()

# ravel() is used unravel the variable into vector

#for median replacement

from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_median.transform(X).ravel()

The missing values are replaced with mean, median or mode.
SD and variance goes down. Lower variablility.

2. Remove entire record

Remove the entire record that has missing values.
We may also lose some important values.
With this technique we lose plenty of data.
This might lead to overfitting of the model.
use drop() , dropna() functions.

3. Most frequent value replacement

Simple Imputer can be used to replace the missing values with the most frequently occurring values.
It works with categorical features.
Does not factor correlation between features.
It can introduce bias into the model by favoring most repeating values in the dataset.

#for most frequent replacement

from sklearn.impute import SimpleImputer
imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mf.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mf.transform(X).ravel()

4.Model based imputation

By using K nearest Neighbor, Regression, Deep Learning
KNN: Simple yet effective method for classification. It uses ‘feature similarity’ to predict missing values.
Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set

from missingpy import KNNImputer
imputer = KNNImputer(n_neighbor  2, weights='uniform')
imputer.fit_transform(attribute)
imputer.transform(attribute).ravel()

import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

Regression: Predictors of the variable with missing values identified via correlation matrix. The best predictors are selected and used as independent variables in a regression equation. Variable with missing data is used as the target variable.
Deep Learning: works well with categorical and non-numeric variables.

5. Interpolation/Extrapolation

Interpolation: Estimating a value within two known values that exist within a sequence of values.
Interpolation is used to ascertain missing data points within a ‘known’ data range
Extrapolation: To estimating an unknown value based on extending a known sequence of values or facts. To infer something not explicitly stated from existing information
Extrapolation is the process of ascertaining missing data points within an ‘unknown’ data range.
You can use the function y = 2x + 5 to extrapolate different values of y when x is outside the range of the present data.

6. Forward filling/ Backward filling

Forward filling means fill missing values with previous data.

x.fillna(method='ffill',inplace=True)

Backward filling means fill missing values with next data point.

7. Hot Deck Impute (HDimpute)

HDI: Replace the missing value with the average of similar values of the variable. Eg: If the dataset set has sex: female and male, there are few missing values for both male and female, then replace the missing values of female by the average of female values and similarly do for male.
HDI attempts to preserve the distribution by substituting different observed values for each missing item.
HDI is computationally expensive.

Handling missing data — part 1: https://sowjanyasadashiva.medium.com/handle-missing-data-part-1-96dee61f74ab