Feature Scaling

Feature scaling is one of most important feature engineerings for many machine learning algorithms (Decision trees don’t need feature scaling necessarily). Most of the algorithms require similar scales for numerical features for good performances. For example, when we predict housing prices, the features would be the number of rooms, square feet, and so on. The number of rooms might range from 1 to 6, while the square feet might range from 500 to 4,000. If you don’t apply feature scaling to the data, the feature, square feet, may affect the prediction a lot more than the number of rooms feature. Generally, scaling the target values is not required.

Let’s take a look at two common scaling approaches in Python: min-max scaling and standardization.

Min-Max Scaling

Min-max scaling is often also called normalization. It rescales numerical values into the range 0 and 1.

$latex x_i^{new} = \frac{x_i – min(x)}{max(x) – min(x)}$

The above equation shows that how a value is scaled given a feature vector x. This shifts the value by substracting the min value and rescales the shifted value by deviding it by the max value minus the min value. The following codes show an example in scikit-learn:

import numpy as np
from sklearn import preprocessing

X_train = np.array([[  1.,  100.,  1300.],
                    [ 10.,  120.,  1200.],
                    [ -5.,   80., -1300.],
                    [  1.,   70.,  2000.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_new = min_max_scaler.fit_transform(X_train)
X_new

Each array indicates a vector of an instance that has three features. All of the values ended up ranging from 0 to 1.

Standardization

Standardization shifts the distribution of each feature to have a mean of zero by substracting the mean value and a standard deviation of one (i.e., unit variance) by dividing it by the variance.

$latex x_i^{new} = \frac{x_i – mean(x)}{stdev(x)}$

This doesn’t rescale values into a specific range. However, it is useful to standardize features for a model that relies on the distribution such as Gaussian processes. Standardization is also much less affected by outliers, while max-min scaling is sensitive with outliers because it uses the max and min values that could be the outliers.

std_scaler = preprocessing.StandardScaler()
X_new = std_scaler.fit_transform(X_train)
X_new