How to handle data cleaning to build accurate machine learning models

Nowadays, data is considered an active asset by companies, since it can be used through machine learning algorithms that allows them to optimize business processes, reduce costs and be aware of market and customer trends.

But applying machine learning is not just using Sckit-learn library and having statistics knowledge. In real life data comes from a variety of sources, flavors, forms, and colors, therefore it is necessary to normalize it before getting into a real machine learning workflow. Based on an IBM article[1] data scientists spend about 80 percent of their time simply finding, cleaning, and organizing data, leaving only 20 percent to actually perform analysis.

But why the need for data cleaning? Well, having inaccurate data can be dangerous. It can lead to misleading decisions producing unexpected costs, loss of customers and misunderstandings inside the team. Additionally and technically speaking, missing data could generate a server error (500 error code) when requesting (or calling) the machine learning model because null data values are not accepted when making ML predictions. (This last issue happened to me). Besides, when you clean your data, all outdated, outlier or noisy information is gone, leaving you with the highest quality data, and that’s the kind of data that our models need to be more accurate.

Now, that we already know the relevance of data cleaning, let’s take a look at ways to deal with missing data.

Drop it.

If missing values rarely happen and occur randomly, drop the rows that contain those missing values. Do the same with a column or a set of it, if most of its values are missing.

Impute it.

We can fill missing values based on other observations in our dataset, there are a lot of methods to do this.

One option is using statistical values like mean or median. Use mean when data is not skewed[2] otherwise, I recommend you use median since it’s not sensitive to outliers and it’s more robust if your data is skewed. Keep in mind, these methods do not guarantee unbiased data. Another option is doing a linear regression based on two existing points of data, however data filled by this method it is sensitive to outliers.

And finally, one of my favorites, K nearest neighbor imputation. Here, missing values are filled by finding the k nearest entries to the target entry. Then, the value for replacing the missing one is chosen by doing some calculations based on nearest entries values or picking one of those entries.

Flag it.

The two previous methods have a disadvantage and it’s a loss of information, inasmuch as missing data it’s informative by itself, and this happens when missing data it’s not at random. In that case it is better tagging missing values to keep aware that those values are missing but not “unknown” (each concept means different things).

As we’ve seen, having a high domain knowledge and comprehending the nature of our missing data it’s require to do an appropriate data cleaning process in order to build accurate machine learning models that leads us to informed decisions, making us more competitive, in front of a changing and challenging environment.

1Armand Ruiz Gabernet, Lead Offering Manager, IBM DSX & WML, Jay Limburn. (23 August 2017). Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity. https://www.ibm.com/cloud/blog/

2Omar Elgabry. (24 August 2019). Statistics & Probability — Exploratory Data Analysis. https://medium.com/omarelgabrys-blog/

Written by: Laura Angélica Cárdenas Vargas, Data Engineer at Uptime Analytics.