6 minute read

In Machine learning the variables that are measured at different scales can impact the numerical stability and precision of the estimators

Some of the Machine Learning estimators may behave in-appropriately if the features have different ranges. For example, Salary in range 50K-150K and Age in range 32-68 years

Data Standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables

In this post we will see how sklearn.preprocessing module is used for scaling, centering, normalization and Transforming data

We will discuss two methods for sklearn.preprocessing i.e., Standard scaler and MinMaxScaler in this post and will briefly touch on other methods as well

Standard Scaler

Using StandardScaler function of sklearn.preprocessing we are standardizing and transforming the data in such a way that the mean of the transformed data is 0 and the Variance is 1

The fit() method is used to compute the mean and std dev which will be used for later scaling. This method computes it separately for each feature in a dataset

Next Transform method is used to perform Standardization by centering and Scaling.

So fit_transform() method can be used on Train set for computing the mean and std for each feature and then transform() can be later used on with the Test data set

Let’s look at an example on how it works

Create a 2D matrix with random integers with range [-1,3]

from sklearn import preprocessing
import numpy as np
X_train = np.random.randint(-1,3, size=(3, 3))
array([[-1, -1, -1],
       [ 0,  1,  2],
       [ 0, -1, -1]])

Create an Object of StandardScaler

scaler = preprocessing.StandardScaler()
StandardScaler(copy=True, with_mean=True, with_std=True)

We will use fit() function to compute the mean, variance and std dev of the X_train along axis = 0

print('  mean: ',scaler.mean_,'n','variance: ',scaler.var_)
mean:  [-0.33333333 -0.33333333  0. ]
variance:  [0.22222222 0.88888889 2. ]

Now transform the X-train data to remove the mean and scale Variance to 1

array([[-1.41421356, -0.70710678, -0.70710678],
       [ 0.70710678,  1.41421356,  1.41421356],
       [ 0.70710678, -0.70710678, -0.70710678]])

This is finally the transformed data using the Standardscaler with mean Zero and Standard Deviation and Variance of 1

Let’s use the preprocessing scale() function to transform the same dataset and see if it gives the same result or not

X_transformed = preprocessing.scale(X_train)
array([[-1.41421356, -0.70710678, -0.70710678],
       [ 0.70710678,  1.41421356,  1.41421356],
       [ 0.70710678, -0.70710678, -0.70710678]])

So we get the same result here as well. Let’s check the mean and Variance of the transformed data

print(' Mean:  ',X_transformed.mean(axis=0),'n','Stdev: ',X_transformed.std(axis=0))
Mean:   [-7.40148683e-17 -7.40148683e-17  0.00000000e+00]
Stdev:  [1. 1. 1.]

The mean and Stdev along axis = 0 is zero and one respectively so the data is standardized appropriately

Let’s try the same scaler on a New dataset

X_test = np.random.randint(-1,3, size=(2,3))
array([[1, 2, 2],
       [2, 1, 2]])

Now use the same scaler instance as above and transform this new or test dataset the same way as it did it for X_train

array([[-1.41421356,  2.47487373,  1.33630621],
       [ 0.70710678,  1.41421356,  1.33630621]])

MinMax Scaler

It standardize the data by transforming within a given range. Moreover the Feature scaling lies between a minimum and maximum value so that the maximum absolute value of each feature is scaled to unit size

Minmaxscaler also scales each of the feature individually for example between 0 and 1. This is good scaling options when you want to preserve the zero’s in a sparse dataset

Following formula is used for the transformation

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

Let’s see how it works for a small dataset. We will use the X_train dataset from above

We will scale X_train to a range [0,3]. The feature_range param used for setting up the desired range of transformed data

The fit_transform method fits to data and then transforms it

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 3))
X_train_minmax = min_max_scaler.fit_transform(X_train)

We can use the same instance of min_max_Scaler on the X_test dataset created above

X_test_minmax = min_max_scaler.transform(X_test)
array([[ 3. ,  0. ,  3. ],
       [ 4.5, -3. ,  3. ]])

Use scale_ attribute to check the min_max_scaler attributes to determine the exact nature of the transformation learned on the training data

The scale_ attribute is Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))

Let’s check the scale_ attributes that is learnt for our example

array([1.5, 1.5, 1. ])

There is another function min_max_scale which is an equivalent function without the estimator API

Real World Dataset

Let’s work on real dataset and check how standard and min_max scaling works on features of sklearn wine dataset

from sklearn.datasets import load_wine
windata = load_wine()

Features of Wine dataset


We will consider only features alcalinity_of_ash and magnesium for scaling and standardization since there are outliers available in these features

X = windata.data[:,[3,4]]
y = windata.target
# scale the output between 0 and 1 for the colorbar
y = minmax_scale(y)

Let’s do a StandardScaler transform of the features

X_transform = StandardScaler().fit_transform(X)

We will plot the data with their target data(Y) as colormaps in a 2D axis

Additionally Standard scalar removes the mean therefore the spread of transformed data for each feature is different. Transformed alcalinity_of_ash has a range of [-2,3] whereas transformed magnesium is squeezed in [-2,4]

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

from matplotlib import cm

The scatter plot

fig, (ax1,ax2) = plt.subplots(1, 2,figsize=(20,6))
colors = cmap(y)
ax2.scatter(X_transform[:, 0], X_transform[:, 1], alpha=0.5, marker='o', s=10, lw=2, c=colors)
ax1.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=10, lw=2, c=colors)
ax1.set_title("Before Scaling")
ax2.set_title("After Scaling")

You can explore the minmax Standardization on wine dataset on your own and check how far each of the features squeezed in

Other Preprocessing scalers, transformers, and normalizers

The scope of this post is only limited to evaluate minmax and standard scaler. I will briefly touch upon other methods that are available and can be explored for normalization or standardization of data.

This is not a complete list but a subset

  1. preprocessing.MaxAbsScaler: Scales each feature by its maximum absolute value. It does not shift/center the data, and therefore does not destroy any sparsity
  2. preprocessing.Normalizer: Normalize samples features individually to unit norm. it is mostly used for text classification or clustering for instance
  3. preprocessing.PowerTransformer: Apply a power transform on features to make it Gaussian like. Common methods are Box-Cox and Yeo-Johnson transform(if data contains negatives)
  4. preprocessing.QuantileTransformer: It transform features using Quantiles to follow a uniform and normal distribution


Here are the key takeways and summary of this entire post:

  • Sklearn preprocessing module is used for Scaling, Normalization and Standardization of the data
  • StandardScaler removes the mean and scales the variance to unit value
  • Minmax scaler scales the features to a specific range often between zero and one so that the maximum absolute value of each feature is scaled to unit size
  • Maxabs scaler scales in a way that the training data lies within the range [-1,1] by dividing each data point with the max value. Assumption is that the data is already centered to Zero
  • Power transformer is used to transform the data into Normal distribution like and it is a type of Non linear transform method
  • Moreover Box cox is used for only positive dataset and if there are negatves in the data then preferred power transform method is Yeo-Johnson transform
  • Normalization is often used in text classification and clustering contexts. It is the process of scaling the data samples to unit norm
  • Quantile Transform is a type of Non-linear transform and transforms data to follow a unform or normal distribution