How to Deal with Missing Values

One of data cleaning processes is about dealing with missing values. It is very common to find missing values in your datasets. To train your model better, you need to deal with the missing values. There are three common approaches:

  • Get rid of the rows (i.e., instances) that have missing values
  • Get rid of the columns (i.e., features) that have missing values
  • Fill in the missing values with some value (e.g., zero, mean, median, … on numerical features, and most frequent value on categorical features)

If you try to fill in missing values with a value (let’s say median), you have to compute the value on the training set and use it in the training set. The computed value (i.e., median) will also be used to fill in missing values in the validation and test sets to evaluate your model.

Generally, the median is preferred rather than the mean because the mean is sensitive to outliers. For example, if there is a pretty expensive house (e.g., $20,000,000) in a district, in which most of the houses in the district is around $100,000. The median price would be around $100,000, but the mean price might be much higher than the median price. When there are outliers, the median value might be a reasonable choice to represent the corresponding feature better.