Handling Categorical Values

In this post, we will look at a common way to deal with categorical values (e.g., small, medium, large). Most machine learning algorithms work with numerical values, so we need to convert text values to numbers.

Given a feature (e.g., t-shirt size) that has categorical values (e.g., [‘extra small’, ‘small’, ‘medium’, ‘large’, ‘extra large’]), a simple conversion would be a mapping between these values and numbers like [‘extra small’, ‘small’, ‘medium’, ‘large’, ‘extra large’] -> [0, 1, 2, 3, 4]. The converted values also could be [1, 3, 2, 0, 4] that could depend on randomness or the order of data in the data processing. The issue with this is that machine learning algorithms assume that two nearby values are more similar than two distant values. Even though the ‘large’ value is close to ‘extra large’, the converted numbers could be far from each other like 0 (for ‘large’), 4 (for ‘extra large’).

A common approach to solve this issue is one-hot encoding. This is a vector representation, in which one attribute has 1 (hot), while the others have 0 (cold). For example, ‘small’ would be converted to [0, 1, 0, 0, 0], and ‘large’ would be [0, 0, 0, 1, 0]. There is also an issue with this approach. What if there are many categories (e.g., thousands) in a categorical feature. The dimensions of the one-hot encoding vector would be huge, and the most of the values in the vectors is 0. It will waste a lot of memory. A sparse matrix (e.g., SciPy sparse matrix) can be used to fix this issue. It stores the location of the nonzero values.

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

The Normal Equation

How to Deal with Missing Values

Underfitting vs. Overfitting

Handling Categorical Values