Standardization by scikit-learn in Python

Step-by-step to a Data Scientist > Blog > Data Science > Standardization by scikit-learn in Python

Scaling variables is important, especially for regression analysis. This is because, in roughly speaking, it means that the analysis is performed without considering the unit.

In this blog, we will see the process of standardization, one of the basic scaling method, that is absolutely necessary for regression analysis.

Why standardization is a need.

For example, the coefficients of regression analysis become larger when the scale of variables becomes larger. It means that the worthies of the regression coefficients are not equivalent if the scales are different.

Here, as an example, we consider creating a mixed regression model with “m” and “cm”, length units. This is the example of without considering the unit.

More specifically, suppose you have the following formula in meters.

$$y = a_{m}x_{m}.$$

When the above formula is converted from meters to centimeters, it becomes as follows.

$$\begin{eqnarray*}
y
&&= \left( \frac{a_{m}}{100}\right) \times (100x_{m}),\\
&&=:a_{cm}x_{cm},
\end{eqnarray*}$$

where $a_{cm}=a_{m}/100$ and $x_{cm}=100x_{m}$.

From the above, we can see that the scales of variables and coefficients change even between equivalent expressions.

In fact, the problem in this case is that the values of the gradients are no longer equivalent. Specifically, consider the following example.

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{m}} &&= a_{m}.
\end{eqnarray*}$$

And,

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{cm}} &&= a_{cm}.
\end{eqnarray*}$$

From the fact that $a_{m} \neq a_{cm}(a_{m}=100a_{cm})$,

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{m}}
\neq
\frac{\partial y}{\partial x_{cm}}.
\end{eqnarray*}$$

Namely, one step based on the gradient is not equivalent.
This fact may cause the optimization to fail when optimizing the model with a gradient-based approach (the steepest descent method).

Mathematically speaking, it means that the gradient space is distorted.
But here, I will just way one step should be alway equivalent.

Therefore, standardization is a need.

Definition of Standardize

Standardization is the operation of converting the mean and standard deviation of data into 0 and 1, respectively.

The mathematical definition is as follows.

$$\begin{eqnarray*}
\tilde{x}=
\frac{x-\mu}{\sigma}
,
\end{eqnarray*}$$

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

How to Standardize in Python?

It’s easy. By using scikit-learn, Standardization can be done in just 4 lines!

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
data_std = scaler.transform(data)

The code description is as follows.

1. Import “StandardScaler()” from “sklearn.preprocessing” in scikit-learn
2. Create the instance for Standardization by the “StandardScaler()” function
3. Calculate the mean and the standard deviation
4. Convert “data” into the standardized data “data_std”

Summary

The above 4 line code is often used, so this post should be a cheat sheet. However, the important thing is to understand why standardization is a need.

Sometimes, some people compare the values of the regression coefficients without scaling to determine the magnitude of the contribution. However, people who understand the meaning of standardization can judge it is wrong!

Standardization is so popular that the author hopes this post helps readers a little.