How to Deal with Missing Values

One of data cleaning processes is about dealing with missing values. It is very common to find missing values in your datasets. To train your model better, you need to deal with the missing values. There are three common approaches:

Get rid of the rows (i.e., instances) that have missing values
Get rid of the columns (i.e., features) that have missing values
Fill in the missing values with some value (e.g., zero, mean, median, … on numerical features, and most frequent value on categorical features)

If you try to fill in missing values with a value (let’s say median), you have to compute the value on the training set and use it in the training set. The computed value (i.e., median) will also be used to fill in missing values in the validation and test sets to evaluate your model.

Generally, the median is preferred rather than the mean because the mean is sensitive to outliers. For example, if there is a pretty expensive house (e.g., $20,000,000) in a district, in which most of the houses in the district is around $100,000. The median price would be around $100,000, but the mean price might be much higher than the median price. When there are outliers, the median value might be a reasonable choice to represent the corresponding feature better.

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

The Normal Equation

Handling Categorical Values

Underfitting vs. Overfitting

How to Deal with Missing Values