Outliers are a very important aspect of Data Analysis. This has many applications in determining fraud and potential new trends in the market. And in other sense outliers are extreme values that deviate from other observations on data , they may indicate a variability in a measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample.
If we talk about outlier with a example, let suppose we have a cars data. In cars data there is one column that display the weight of car. In weight column 1.5 ton is average weight and cars are weight near to that. But suddenly I was one or two values that around 30 ton. So under this observation I found this is kind of mistake or misleading data from resource. So these kind of anomalous know as outlier in data analysis.
How to detect and visualize outliers
Data visualization is a core discipline for analysts and optimizers, not just to better communicate results with executives, but to explore the data fully.
As such, outliers are often detected through graphical means, though you can also do so by a variety of statistical method using your favorite tool.
Two of the most common graphical ways of detecting outliers are the boxplot and the scatterplot. A boxplot is my favorite way.
So here i can show to outlier in load_boston dataset, which is available in sklearn python library.
# load dataset from sklearn from sklearn.datasets import load_boston # import pandas for data manipulation. import pandas as pd # import library for visulization. import matplotlib.pyplot as plt
Prepare you X axis and Y axis data from datatset.
x_input = load_boston()['data'] y_output = load_boston()['target'] columns = load_boston()['feature_names']
Discover outlier with visualization for colum DIS.
df = pd.DataFrame(data=x_input, columns=columns) plt.boxplot(df['DIS']) plt.show()
Output will be:
You could also use seaborn library to display graph.
import seaborn as sns sns.boxplot(x='DIS', data=df)
Note: So here you can clearly see outlier in the graph that out of range. Like dots above the 10th Y axis. So this is the some of mistake or invalid data that need to be preprocess.
How to remove outlier? Good Question !
There is different ways to handle outlier in data analysis. But here we shall discuss mathematically remove outliers.
# Discover outlier with mathmatical Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 outliers = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)) removed_outlier_df = df[~outliers.any(axis=1)] sns.boxplot(x='DIS', data=removed_outlier_df)
Output is :
So here you can see outlier are moved from data and visual there is no outliers.
Complete Example with additional comments.
# Find out the outlier and remove from data. for data cleaning. from sklearn.datasets import load_boston import pandas as pd import matplotlib.pyplot as plt load_boston().keys() x_input = load_boston()['data'] y_output = load_boston()['target'] columns = load_boston()['feature_names'] df = pd.DataFrame(data=x_input, columns=columns) print(df.columns) # Discover outlier with visualization tool. plt.boxplot(df['DIS']) plt.show() # with seabor import seaborn as sns #sns.boxplot(x=df['DIS']) #sns.boxplot(x='DIS', data=df) # Discover outlier with mathmatical Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 print(df.shape) outliers = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)) removed_outlier_df = df[~outliers.any(axis=1)] print(removed_outlier_df.shape) sns.boxplot(x='DIS', data=removed_outlier_df)