Handling Missing values — part 2

Sowjanya Sadashiva
4 min readMay 22, 2023

--

There are various techniques to handle missing values.

  1. Mode/Median/Mean value replacement
  2. Remove entire record
  3. Most frequent value replacement
  4. Model based imputation
  5. Interpolation/Extrapolation
  6. forward filling/ Backward filling
  7. Hot deck imputation

Null value replacement

There are multiple reasons for missing values, these missing values are often represented as Nan(Not a Number), Null. These missing values needs to be taken care of before modeling the data.

Which technique to use completely depends on the dataset.

1. Mode/Median/Mean value replacement

  • df.isnull().sum() prints the column with missing value.
  • sklearn.impute.SimpleImputer is used for missing values imputation using mean, median, mode.
  • Plots such as box plots and density plots come very handily in deciding which techniques to use.
  • Mean: It is often not a good measurement. when data is skewed, it affects the mean. An outlier can change the mean significantly. This can be used when the variable is approximately normal and when the outlier is highly unlikely. This can not be used for categorical features.
  • Median: it is less to outliers compared to mean, it is preferred when the data is skewed.
  • Mode: It is suitable for categorical variables or numerical variables with a small number of unique values.
#for mean replacement

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mean.transform(X).ravel()

# ravel() is used unravel the variable into vector
#for median replacement

from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_median.transform(X).ravel()
  • The missing values are replaced with mean, median or mode.
  • SD and variance goes down. Lower variablility.

2. Remove entire record

  • Remove the entire record that has missing values.
  • We may also lose some important values.
  • With this technique we lose plenty of data.
  • This might lead to overfitting of the model.
  • use drop() , dropna() functions.

3. Most frequent value replacement

  • Simple Imputer can be used to replace the missing values with the most frequently occurring values.
  • It works with categorical features.
  • Does not factor correlation between features.
  • It can introduce bias into the model by favoring most repeating values in the dataset.
#for most frequent replacement

from sklearn.impute import SimpleImputer
imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mf.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mf.transform(X).ravel()

4.Model based imputation

  • By using K nearest Neighbor, Regression, Deep Learning
  • KNN: Simple yet effective method for classification. It uses ‘feature similarity’ to predict missing values.
  • Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set
from missingpy import KNNImputer
imputer = KNNImputer(n_neighbor 2, weights='uniform')
imputer.fit_transform(attribute)
imputer.transform(attribute).ravel()
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])
  • Regression: Predictors of the variable with missing values identified via correlation matrix. The best predictors are selected and used as independent variables in a regression equation. Variable with missing data is used as the target variable.
  • Deep Learning: works well with categorical and non-numeric variables.

5. Interpolation/Extrapolation

  • Interpolation: Estimating a value within two known values that exist within a sequence of values.
  • Interpolation is used to ascertain missing data points within a ‘known’ data range
  • Extrapolation: To estimating an unknown value based on extending a known sequence of values or facts. To infer something not explicitly stated from existing information
  • Extrapolation is the process of ascertaining missing data points within an ‘unknown’ data range.
  • You can use the function y = 2x + 5 to extrapolate different values of y when x is outside the range of the present data.

6. Forward filling/ Backward filling

  • Forward filling means fill missing values with previous data.
x.fillna(method='ffill',inplace=True)
  • Backward filling means fill missing values with next data point.

7. Hot Deck Impute (HDimpute)

  • HDI: Replace the missing value with the average of similar values of the variable. Eg: If the dataset set has sex: female and male, there are few missing values for both male and female, then replace the missing values of female by the average of female values and similarly do for male.
  • HDI attempts to preserve the distribution by substituting different observed values for each missing item.
  • HDI is computationally expensive.

--

--

Sowjanya Sadashiva
Sowjanya Sadashiva

Written by Sowjanya Sadashiva

I am a computer science enthusiast with Master's degree in Computer Science and Specialization in Data Science.

No responses yet