Python Simple Imputer module

If you are interested to learn about python-opencv-object-detection

In this tutorial, we are going to learn about the SimpleImputer module of the Sklearn library, and it was previously known as impute module but updated in the latest versions of the Sklearn library. We will discuss the SimpleImputer class and how we can use it to handle missing data in a dataset and replace the missing values inside the dataset using a Python program.

What is SimpleImputer Python?

SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder.

SimpleImputer class

A scikit-learn class that we can use to handle the missing values in the data from the dataset of a predictive model is called SimpleImputer class. With the help of this class, we can replace NaN (missing values) values in the dataset with a specified placeholder. We can implement and use this module class by using the SimpleImputer() method in the program.

Syntax for SimpleImputer() method:

To implement the SimpleImputer() class method into a Python program, we have to use the following syntax:

SimpleImputer(missingValues, strategy)

Parameters: Following are the parameters which has to be defined while using the SimpleImputer() method:

missingValues: It is the missing values placeholder in the SimpleImputer() method which has to be imputed during the execution, and by default, the value for missing values placeholder is NaN.
strategy: It is the data that is going to replace the missing values (NaN values) from the dataset, and by default, the value method for this parameter is ‘Mean’. The strategy parameter of the SimpleImputer() method can take ‘Mean’, ‘Mode’, Median’ (Central tendency measuring methods) and ‘Constant’ value input in it.
fillValue: This parameter is used only in the strategy parameter if we give ‘Constant’ as replacing value method. We have to define the constant value for the strategy parameter, which is going to replace the NaN values from the dataset.

SimpleImputer class is the module class of Sklearn library, and to use this class, first we have to install the Sklearn library in our system if it is not present already.

Installation of Sklearn library:

We can install the Sklearn by using the following command inside the command terminal prompt of our system:

pip install sklearn

After pressing the enter key, the sklearn module will start installing in our device, as we can see below:

Now, the Sklearn module is installed in our system, and we can move ahead with the SimpleImputer class function.

Handling NaN values in the dataset with SimpleImputer class

Now, we will use the SimpleImputer class in a Python program to handle the missing values present in the dataset (that we will use in the program). We will define a dataset in the example program while giving some missing values in it, and then we use the SimpleImputer class method to handle those values from the dataset by defining its parameters. Let’s understand the implementation of this through an example Python program.

Example 1: Look at the following Python program with a dataset having NaN values defined in it:

# Import numpy module as nmp  
import numpy as nmp  
# Importing SimpleImputer class from sklearn impute module  
from sklearn.impute import SimpleImputer  
# Setting up imputer function variable  
imputerFunc = SimpleImputer(missing_values = nmp.nan, strategy ='mean')  
# Defining a dataset  
dataSet = [[32, nmp.nan, 34, 47], [17, nmp.nan, 71, 53], [19, 29, nmp.nan, 79], [nmp.nan, 31, 23, 37], [19, nmp.nan, 79, 53]]  
# Print original dataset  
print("The Original Dataset we defined in the program: \n", dataSet)  
# Imputing dataset by replacing missing values  
imputerFunc = imputerFunc.fit(dataSet)  
dataSet2 = imputerFunc.transform(dataSet)  
# Printing imputed dataset  
print("The imputed dataset after replacing missing values from it: \n", dataSet2)

Output:

The Original Dataset we defined in the program: 
 [[32, nan, 34, 47], [17, nan, 71, 53], [19, 29, nan, 79], [nan, 31, 23, 37], [19, nan, 79, 53]]
The imputed dataset after replacing missing values from it: 
 [[32.   30.   34.   47.  ]
 [17.   30.   71.   53.  ]
 [19.   29.   51.75 79.  ]
 [21.75 31.   23.   37.  ]
 [19.   30.   79.   53.  ]]

Explanation:

We have firstly imported the numpy module (to define a dataset) and sklearn module (to use the SimpleImputer class method) into the program. Then, we defined the imputer to handle the missing values using the SimpleImputer class method, and we used the ‘mean’ strategy to replace the missing values from the dataset. After that, we have defined a dataset in the program using the numpy module function and gave some missing values (NaN values) in the dataset. Then, we printed the original dataset in the output. After that, we have imputed and replaced the missing values from the dataset with the imputer that we have defined earlier in the program with SimpleImputer class. After imputing the dataset and replacing the missing values from it, we have printed the new dataset as a result.

As we can see in the output, the imputed value dataset having mean values in the place of missing values, and that’s how we can use the SimpleImputer module class to handle NaN values from a dataset.

What is a Missing Data

As the name suggests when the value of an attribute is missing in the dataset it is called missing value. Handling these missing values is very tricky for data scientists because any wrong treatment of these missing values can end up compromising the accuracy of the machine learning model.

Types of Missing Data

There are various characteristics of missing data that you should first understand before addressing it. The missing data falls in one of the following categories –

1. Missing at Random (MAR)

In this scenario, the missing data has some relationship with other variables in the dataset. E.g. in a survey, the phone number fields may not be filled by most of the females due to security concerns.

2. Missing Completely at Random (MCAR)

In this scenario, the data is missing just randomly and there is no relationship with other variables in the dataset. E.g. some data might be missing randomly due to some technical issue or due to human error.

3. Missing Not at Random (MNAR)

In this scenario, the data is not missing randomly and the missingness is attributed to the data that was supposed to be captured. MNAR is quite tricky to spot and deal with. E.g. in a survey form, the rich people may not fill the Income field as they would not like to disclose it.

How to Deal with Missing Data

There are various strategies available to address the issue of the missing data however which one works best depends on your dataset. There is no thumb rule, so you will have to assess your dataset and experiment with various strategies.

1. Dropping the Variables with Missing Data

In this strategy, the row or column containing the missing data is deleted completely. This should be used cautiously as you may end up losing important information about the data. Domain knowledge is quite useful to decide whether dropping the columns is the ideal solution for your dataset.

2. Imputation of Data

In this technique, the missing data is filled up or imputed by a suitable substitute and there are multiple strategies behind it.

i) Replace with Mean

Here all the missing data is replaced by the mean of the corresponding column. It works only with a numeric field. However, we have to be cautious here because if the data in the column contains outliers its mean will be misleading

ii) Replace with Median

Here the missing data is replaced with the median values of that column and again it is applicable only with numerical columns.

iii) Replace with Most Frequent Occurring

In this technique, the missing values are filled with the value which occurs the highest number of times in a particular column. This approach is applicable for both numeric and categorical columns.

iv) Replace with Constant

In this approach, the missing data is replaced by a constant value throughout. This can be used with both numeric and categorical columns.

Sklearn Simple Imputer

Sklearn provides a module SimpleImputer that can be used to apply all the four imputing strategies for missing data that we discussed above.

Sklearn Imputer vs SimpleImputer

The old version of sklearn used to have a module Imputer for doing all the imputation transformation. However, the Imputer module is now deprecated and has been replaced by a new module SimpleImputer in the recent versions of Sklearn. So for all imputation purposes, you should now use SimpleImputer in Sklearn.

Getting started with the SimpleImputer

To start using the SimpleImputer class, you must install the Scikit-Learn library in your machine alongside Python. You can run the following command from your command line/terminal to install scikit-learn using Python’s Package Manager (pip):

pip install scikit-learn

Once you’ve installed the library, you can import it in Python by running the following line of code in your Python IDE or Python Shell.

import sklearn

If running this line of code doesn’t give you an error, you’ve successfully installed Scikit-Learn and imported it in Python. Now, you can use the SimpleImputer to fill missing values.

Performing imputation using the ‘mean’ strategy in SimpleImputer

The ‘mean’ strategy of SimpleImputer replaces missing values using the median along each column and this can only be used with numeric data.

Here’s an example of how a ‘mean’ strategy can be used to fill missing values using the SimpleImputer:# Importing the NumPy library to create nan valuesimport numpy as np# Importing the SimpleImputer class from sklearnfrom sklearn.impute import SimpleImputer# Initializing the SimpleImputer object with missing_value and strategy definedimp_mean = SimpleImputer(missing_values=np.nan, strategy=’mean’)# Fitting the SimpleImputer using a sample datasetimp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])# Initializing a dataset that isn’t fitted to the SimpleImputerX = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]# Filling in the missing values in X using the fitted SimpleImputerprint(imp_mean.transform(X))

[[ 7. 2. 3. ] 
 [ 4. 3.5 6. ] 
 [10. 3.5 9. ]]

In the example above, you can see that we fitted the SimpleImputer using a sample dataset which in itself contained missing values. Then, the dataset X is transformed to fill in the missing values using the fitted SimpleImputer. This kind of imputation where you fill in the missing values with the mean is also known as ‘mean imputation’.

Performing imputation using the ‘median’ strategy in SimpleImputer

The ‘median’ strategy of SimpleImputer replaces missing values using the median along each column and this can only be used with numeric data.

Here’s an example of how a ‘median’ strategy can be used to fill missing values using the SimpleImputer:# Importing the NumPy library to create nan valuesimport numpy as np# Importing the SimpleImputer class from sklearnfrom sklearn.impute import SimpleImputer# Initializing the SimpleImputer object with missing_value and strategy definedimp_median = SimpleImputer(missing_values=np.nan, strategy=’median’)# Fitting the SimpleImputer using the given datasetimp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])# Initializing a sample dataset that isn’t fittedX = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]# Filling in the missing values in the sample dataset using the fitted SimpleImputerprint(imp_median.transform(X))

[[ 7. 2. 3. ]
[ 4. 3.5 6. ]
[10. 3.5 9. ]]

Performing imputation using the ‘most_frequent’ strategy in SimpleImputer

The ‘most_frequent’ strategy of SimpleImputer replaces missing values using the most frequent value along each column and it can be used with strings or numeric data. If there is more than one such value, only the smallest value is returned.

Here’s an example of how the ‘most_frequent’ strategy can be used to fill missing values using the SimpleImputer:# Importing the NumPy library to create nan valuesimport numpy as np# Importing the SimpleImputer class from sklearnfrom sklearn.impute import SimpleImputer# Initializing the SimpleImputer object with missing_value and strategy definedimp_most_freq = SimpleImputer(missing_values=np.nan, strategy=’most_frequent’)# Fitting the SimpleImputer using the given datasetimp_most_freq.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])# Initializing a sample dataset that isn’t fittedX = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]# Filling in the missing values in the sample dataset using the fitted SimpleImputerprint(imp_most_freq.transform(X))

[[ 4. 2. 3.]
 [ 4. 2. 6.]
 [10. 2. 9.]]

Performing imputation using the ‘constant’ strategy in SimpleImputer

The ‘constant’ strategy of SimpleImputer replaces missing values using a provided fill_value and it can be used with strings or numeric data.

Here’s an example of how the ‘constant’ strategy can be used to fill missing values using the SimpleImputer:# Importing the NumPy library to create nan valuesimport numpy as np# Importing the SimpleImputer class from sklearnfrom sklearn.impute import SimpleImputer# Initializing the SimpleImputer object with missing_value and strategy definedimp_constant = SimpleImputer(missing_values=np.nan, strategy=’constant’, fill_value=20)# Fitting the SimpleImputer using the given datasetimp_constant.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])# Initializing a sample dataset that isn’t fittedX = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]# Filling in the missing values in the sample dataset using the fitted SimpleImputerprint(imp_constant.transform(X))

[[20. 2. 3.]
 [ 4. 20. 6.]
 [10. 20. 9.]]

Examples of Simple Imputer in Sklearn

Create Toy Dataset

We will create a toy dataset with the random numbers and then randomly set some values as nulls. Just to make this dataset more suitable for our examples, we duplicate two cells of the datframes.

In [1]:

# Create a radnom datset of 10 rows and 4 columns
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

# Randomly set some values as null
df = df.mask(np.random.random((10, 4)) &lt; .15)

# Duplicate two cells with same values
df['B'][8] = df['B'][9]
df

Out[1]:

	A	B	C	D
0	-0.520643	NaN	0.080238	NaN
1	1.225041	0.505089	-1.997088	NaN
2	-0.004976	-0.082857	0.376651	-0.626456
3	1.880424	0.527540	0.129820	1.384916
4	-0.476005	NaN	-1.499829	0.334039
5	2.134381	-0.365297	1.554248	-0.118477
6	0.103352	0.458311	0.424156	NaN
7	-0.759686	-0.356959	1.261324	-0.278455
8	-0.331476	0.810893	-0.466366	-0.582135
9	-1.991735	-0.343604	-0.393095	-1.190406

i) Sklearn SimpleImputer with Mean

We first create an instance of SimpleImputer with strategy as ‘mean’. This is the default strategy and even if it is not passed, it will use mean only. Finally, the dataset is fit and transformed and we can see that the null values of columns B and D are replaced by the mean of respective columns.

In [2]:

mean_imputer = SimpleImputer(strategy='mean')

result_mean_imputer = mean_imputer.fit_transform(df)

pd.DataFrame(result_mean_imputer, columns=list('ABCD'))

Out[2]:

	A	B	C	D
0	-0.520643	-0.000173	0.080238	-0.153853
1	1.225041	0.505089	-1.997088	-0.153853
2	-0.004976	-0.082857	0.376651	-0.626456
3	1.880424	0.527540	0.129820	1.384916
4	-0.476005	-0.000173	-1.499829	0.334039
5	2.134381	-0.365297	1.554248	-0.118477
6	0.103352	0.458311	0.424156	-0.153853
7	-0.759686	-0.356959	1.261324	-0.278455
8	-0.331476	-0.343604	-0.466366	-0.582135
9	-1.991735	-0.343604	-0.393095	-1.190406

ii) Sklearn SimpleImputer with Median

We first create an instance of SimpleImputer with strategy as ‘median’ and then the dataset is fit and transformed. We can see that the null values of columns B and D are replaced by the mean of respective columns.

In [3]:

median_imputer = SimpleImputer(strategy='median')

result_median_imputer = median_imputer.fit_transform(df)

pd.DataFrame(result_median_imputer, columns=list('ABCD'))

Out[3]:

	A	B	C	D
0	-0.520643	-0.213231	0.080238	-0.278455
1	1.225041	0.505089	-1.997088	-0.278455
2	-0.004976	-0.082857	0.376651	-0.626456
3	1.880424	0.527540	0.129820	1.384916
4	-0.476005	-0.213231	-1.499829	0.334039
5	2.134381	-0.365297	1.554248	-0.118477
6	0.103352	0.458311	0.424156	-0.278455
7	-0.759686	-0.356959	1.261324	-0.278455
8	-0.331476	-0.343604	-0.466366	-0.582135
9	-1.991735	-0.343604	-0.393095	-1.190406

iii) Sklearn SimpleImputer with Most Frequent

We first create an instance of SimpleImputer with strategy as ‘most_frequent’ and then the dataset is fit and transformed. If there is no most frequently occurring number Sklearn SimpleImputer will impute with the lowest integer on the column. We can see that the null values of column B are replaced with -0.343604 that is the most frequently occurring in that column. In column D since there is no such frequently occurring number the nulls got replaced by the lowest number -1.190406

In [4]:

most_frequent_imputer = SimpleImputer(strategy='most_frequent')
result_most_frequent_imputer = most_frequent_imputer.fit_transform(df)
pd.DataFrame(result_most_frequent_imputer, columns=list('ABCD'))

Out[4]:

	A	B	C	D
0	-0.520643	-0.343604	0.080238	-1.190406
1	1.225041	0.505089	-1.997088	-1.190406
2	-0.004976	-0.082857	0.376651	-0.626456
3	1.880424	0.527540	0.129820	1.384916
4	-0.476005	-0.343604	-1.499829	0.334039
5	2.134381	-0.365297	1.554248	-0.118477
6	0.103352	0.458311	0.424156	-1.190406
7	-0.759686	-0.356959	1.261324	-0.278455
8	-0.331476	-0.343604	-0.466366	-0.582135
9	-1.991735	-0.343604	-0.393095	-1.190406

iv) Sklearn SimpleImputer with Constant

We first create an instance of SimpleImputer with strategy as ‘constant’ and fill_value as 99. If we don’t supply fill_value it will take 0 as default for numerical columns. Also in a numeric column, SimpleImputer does not accept a string for default fill. The dataset is fit and transformed and we can see that all nulls are replaced by 99.

In [5]:

constant_imputer = SimpleImputer(strategy='constant',fill_value=99)
result_constant_imputer = constant_imputer.fit_transform(df)
pd.DataFrame(result_constant_imputer, columns=list('ABCD'))

Out[5]:

	A	B	C	D
0	-0.520643	99.000000	0.080238	99.000000
1	1.225041	0.505089	-1.997088	99.000000
2	-0.004976	-0.082857	0.376651	-0.626456
3	1.880424	0.527540	0.129820	1.384916
4	-0.476005	99.000000	-1.499829	0.334039
5	2.134381	-0.365297	1.554248	-0.118477
6	0.103352	0.458311	0.424156	99.000000
7	-0.759686	-0.356959	1.261324	-0.278455
8	-0.331476	-0.343604	-0.466366	-0.582135
9	-1.991735	-0.343604	-0.393095	-1.190406