Handling Categorical Values

In this post, we will look at a common way to deal with categorical values (e.g., small, medium, large). Most machine learning algorithms work with numerical values, so we need to convert text values to numbers.

Given a feature (e.g., t-shirt size) that has categorical values (e.g., [‘extra small’, ‘small’, ‘medium’, ‘large’, ‘extra large’]), a simple conversion would be a mapping between these values and numbers like [‘extra small’, ‘small’, ‘medium’, ‘large’, ‘extra large’] -> [0, 1, 2, 3, 4]. The converted values also could be [1, 3, 2, 0, 4] that could depend on randomness or the order of data in the data processing. The issue with this is that machine learning algorithms assume that two nearby values are more similar than two distant values. Even though the ‘large’ value is close to ‘extra large’, the converted numbers could be far from each other like 0 (for ‘large’), 4 (for ‘extra large’).

A common approach to solve this issue is one-hot encoding. This is a vector representation, in which one attribute has 1 (hot), while the others have 0 (cold). For example, ‘small’ would be converted to [0, 1, 0, 0, 0], and ‘large’ would be [0, 0, 0, 1, 0]. There is also an issue with this approach. What if there are many categories (e.g., thousands) in a categorical feature. The dimensions of the one-hot encoding vector would be huge, and the most of the values in the vectors is 0. It will waste a lot of memory. A sparse matrix (e.g., SciPy sparse matrix) can be used to fix this issue. It stores the location of the nonzero values.