Data Preparation for Machine Learning: A Step-by-Step Guide

Spotify avoided a crucial mistake companies make when it comes to preparing data for machine learning — not investing enough effort or skipping the stage whatsoever.

Many businesses assume that feeding large volumes of data into an ML engine is enough to generate accurate predictions. The truth is it can result in a number of problems, for example, algorithmic bias or limited scalability.

The success of machine learning depends heavily on data.

And the sad given is: all data sets are flawed. That is why data preparation is crucial for machine learning. It helps rule out inaccuracies and bias inherent in raw data, so that the resulting ML model generates more reliable and accurate predictions.

In this blog post, we highlight the importance of preparing data for machine learning and share our approach to collecting, cleaning, and transforming data. So, if you’re new to ML and want to ensure your initiative turns out a success, keep reading.

How to prepare data for machine learning

The first step towards successfully adopting ML is clearly formulating your business problem. Not only does it ensure that the ML model you’re building is aligned with your business needs, but it also allows you to save time and money on preparing data that might not be relevant.

Additionally, a clear problem statement makes the ML model explainable (meaning users understand how it makes decisions). It’s especially important in sectors like healthcare and finance, where machine learning has a major impact on people’s lives.

With the business problem nailed down, it’s time to kick off the data work.

Overall, the process of preparing data for machine learning can be broken down into the following stages:

Data collection
Data cleaning
Data transformation
Data splitting

Let’s have a closer look at each.

Data collection

Data preparation for machine learning starts with data collection. During the data collection stage, you gather data for training and tuning the future ML model. Doing so, keep in mind the type, volume, and quality of data: these factors will determine the best data preparation strategy.

Machine learning uses three types of data: structured, unstructured, and semi-structured.

Structured data is organized in a specific way, typically in a table or spreadsheet format. The examples of structured data span information collected from databases or transactional systems.
Unstructured data includes images, videos, audio recordings, and other information that does not follow conventional data models.
Semi-structured data doesn’t follow a format of a tabular data model. Still, it’s not completely disorganized, as it contains some structural elements, like tags or metadata that make it easier to interpret. The examples include data in XML or JSON formats.

The structure of the data determines the optimal approach to preparing data for machine learning. Structured data, for example, can be easily organized into tables and cleaned via deduplication, filling in missing values, or standardizing data formats.

In contrast, extracting relevant features from unstructured data requires more complex techniques, such as natural language processing or computer vision.

The optimal approach to data preparation for machine learning is also affected by the volume of training data. A large dataset may require sampling, which involves selecting a subset of the data to train the model due to computational limitations. A smaller one, in turn, may require data scientists to take additional steps to generate more data based on the existing data points (more on that below.)

The quality of collected data is crucial as well. Using inaccurate or biased data can affect ML output, which can have significant consequences, especially in such areas as finance, healthcare, and criminal justice. There are techniques that allow data to be corrected for error and bias. However, they may not work on a dataset that is inherently skewed.
Once you know what makes “good” data, you must decide how to collect it and where to find it. There are several strategies for that:

Collecting data from internal sources: if you have information stored in your enterprise data warehouse, you can use it for training ML algorithms. This data could include sales transactions, customer interactions, data from social media platforms, and other sources.
Collecting data from external sources: You can turn to publicly available data sources, such as government data portals, academic data repositories, and data sharing communities, such as Kaggle, UCI Machine Learning Repository, or Google Dataset Search.
Web scraping: This technique involves extracting data from websites using automated tools. This approach may be useful for collecting data from sources that are not accessible through other means, such as product reviews, news articles, and social media.
Surveys: this approach can be used to collect specific data points from a specific target audience. It is especially useful for collecting information on user preferences or behavior.

Sometimes though, these strategies don’t yield enough data. You can compensate for the lack of data points with these techniques:

Data augmentation, which allows generating more data from existing samples by transforming them in a variety of ways, for example, rotating, translating, or scaling
Active learning, which allows selecting the most informative data sample for labeling by a human expert.
Transfer learning, which involves using pre-trained ML algorithms applied for solving a related task as a starting point for training a new ML model, followed by fine-tuning the new model on new data.
Collaborative data sharing, which involves working with other researchers and organizations to collect and share data for a common goal.

Data cleaning

The next step to take to prepare data for machine learning is to clean it. Cleaning data involves finding and correcting errors, inconsistencies, and missing values. There are several approaches to doing that:

Handling missing data
Missing values is a common issue in machine learning. It can be handled by imputation (think: filling in missing values with predicted or estimated data), interpolation (deriving missing values from the surrounding data points), or deletion (simply removing rows or columns with missing values from a dataset.)
Handling outliers
Outliers are data points that significantly differ from the rest of the dataset. Outliers can occur due to measurement errors, data entry errors, or simply because they represent unusual or extreme observations. In a dataset of employee salaries, for example, an outlier may be an employee who earns significantly more or less than others. Outliers can be handled by removing, transforming them to reduce their impact, winsorizing (think: replacing extreme values with the nearest values that are within the normal range of distribution), or treating them as a separate class of data.
Removing duplicates
Another step in the process of preparing data for machine learning is removing duplicates. Duplicates don’t only skew ML predictions, but also waste storage space and increase processing time, especially in large datasets. To remove duplicates, data scientists resort to a variety of duplicate identification techniques (like exact matching, fuzzy matching, hashing, or record linkage). Once identified, they can be either dropped or merged. However, in unbalanced datasets, duplicates can in fact be welcomed for achieving normal distribution.
Handling irrelevant data
Irrelevant data refers to the data that is not useful or applicable to solving the problem. Handling irrelevant data can help reduce noise and improve prediction accuracy. To identify irrelevant data, data teams resort to such techniques as principal component analysis, correlation analysis, or simply rely on their domain knowledge. Once identified, such data points are removed from the dataset.
Handling incorrect data
Data preparation for machine learning must also include handling incorrect and erroneous data. Common techniques of dealing with such data include data transformation (changing the data, so that it meets the set criteria) or removing incorrect data points altogether.
Handling imbalanced data
An imbalanced dataset is a dataset in which the number of data points in one class is significantly lower than the number of data points in another class. This can result in a biased model that is prioritizing the majority class, while ignoring the minority class. To deal with the issue, data teams may resort to such techniques as resampling (either oversampling the minority class or undersampling the majority class to balance the distribution of data), synthetic data generation (generating additional data points for the minority class synthetically), cost-sensitive learning (assigning higher weight to the minority class during training), ensemble learning (combining multiple models trained on different data subsets using different algorithms), and others.

These activities help ensure that the training data is accurate, complete, and consistent. Though a big achievement, it is not enough to produce a reliable ML model just yet. So, the next step on the journey of preparing data for machine learning involves making sure the data points in the training data set conform to specific rules and standards. And that stage in the data management process is referred to as data transformation.

Data transformation

During the data transformation stage, you convert raw data into a format suitable for machine learning algorithms. That, in turn, ensures higher algorithmic performance and accuracy.

Our experts in preparing data for machine learning name the following common data transformation techniques:

Scaling
In a dataset, different features may use different units of measurement. For example, a real estate dataset may include the information about the number of rooms in each property (ranging from one to ten) and the price (ranging from $50,000 to $1,000,000). Without scaling, it is challenging to balance the importance of both features. The algorithm might give too much importance to the feature with larger values — in this case, the price — and not enough to the feature with seemingly smaller values. Scaling helps solve this problem by transforming all data points in a way that makes them fit a specified range, typically, between 0 and 1. Now you can compare different variables on equal footing.
Normalization
Another technique used in data preparation for machine learning is normalization. It is similar to scaling. However, while scaling changes the range of a dataset, normalization changes its distribution.
Encoding
Categorical data has a limited number of values, for example, colors, car models, or animal species. Because machine learning algorithms typically work with numerical data, categorical data must be encoded in order to be used as an input. So, encoding stands for converting categorical data into a numerical format. There are several encoding techniques to choose from, including one-hot encoding, ordinal encoding, and label encoding.
Discretization
Discretization is an approach to preparing data for machine learning that allows transforming continuous variables, such as time, temperature, or weight, into discrete ones. Consider a dataset that contains information about people’s height. The height of each person can be measured as a continuous variable in feet or centimeters. However, for certain ML algorithms, it might be necessary to discretize this data into categories, say, “short”, “medium”, and “tall”. This is exactly what discretization does. It helps simplify the training dataset and reduce the complexity of the problem. Common approaches to discretization span clustering-based and decision-tree-based discretization.
Dimensionality reduction
Dimensionality reduction stands for limiting the number of features or variables in a dataset and only preserving the information relevant for solving the problem. Consider a dataset containing information on customers’ purchase history. It features the date of purchase, the item bought, the price of the item, and the location where the purchase took place. Reducing the dimensionality of this dataset, we omit all but the most important features, say, the item purchased and its price. Dimensionality reduction can be done with a variety of techniques, some of them being principal component analysis, linear discriminant analysis, and t-distributed stochastic neighbor embedding.
Log transformation
Another way of preparing data for machine learning, log transformation, refers to applying a logarithmic function to the values of a variable in a dataset. It is often used when the training data is highly skewed or has a large range of values. Applying a logarithmic function can help make the distribution of data more symmetric.

Speaking of data transformation, we should mention feature engineering, too. While it is a form of data transformation, it is more than a technique or a step in the process of preparing data for machine learning. It stands for selecting, transforming, and creating features in a dataset. Feature engineering involves a combination of statistical, mathematical, and computational techniques, including the use of ML models, to create features that capture the most relevant information in the data.

It is usually an iterative process that requires testing and evaluating different techniques and feature combinations in order to come up with the best approach to solving a problem.

Data splitting

The next step in the process of preparing data for machine learning involves dividing all gathered data into subsets — the process known as data splitting. Typically, the data is broken down into a training, validation, and testing dataset.

A training dataset is used to actually teach a machine learning model to recognize patterns and relationships between input and target variables. This dataset is typically the largest.
A validation dataset is a subset of data that is used to evaluate the performance of the model during training. It helps fine-tune the model by adjusting hyperparameters (think: parameters of the training process that are set manually before training, like the learning rate, regularization strength, or the number of hidden layers). The validation dataset also helps prevent overfitting to the training data.
A testing dataset is a subset of data that is used to evaluate the performance of the trained model. Its goal is to assess the accuracy of the model on new, unseen data. The testing dataset is only used once — after the model has been trained and fine-tuned on the training and validation datasets.

By splitting the data, we can assess how well a machine learning model performs on data it hasn’t seen before. With no splitting, chances are the model would perform poorly on new data. This can happen because the model may have just memorized the data points instead of learning patterns and generalizing them to new data.

There are several approaches to data splitting, and the choice of the optimal one depends on the problem being solved and the properties of the dataset. Our experts in preparing data for machine learning say that it often requires some experimentation from the data team to determine the most effective splitting strategy. The following are the most common ones:

Random sampling, where, as the name suggests, the data is split randomly. This approach is often applied to large datasets representative of the population being modeled. Alternatively, it is used when there are no known relationships in the data that call for a more specialized approach.
Stratified sampling, where the data is divided into subsets based on class labels or other characteristics, followed by randomly sampling these subsets. This strategy is applied to imbalanced datasets with the number of values in one class significantly exceeding the number of values in others. In that case, stratified sampling helps to make sure that the training and testing datasets have a similar distribution of values from each class.
Time-based sampling, where the data collected up to a certain point of time makes a training dataset, while the data collected after the set point is formed into a testing dataset. This approach is used when the data has been collected over a long period of time, for instance, in financial or medical datasets, as it allows to ensure that the model can make accurate predictions on future data.
Cross-validation, where the data is divided into multiple subsets, or folds. Some folds are used to train the model, while the remaining are used for performance evaluation. The process is repeated multiple times, with each fold serving as testing data at least once. There are several cross-validation techniques, for example, k-fold cross-validation and leave-one-out cross-validation. Cross-validation usually provides a more accurate estimate of the model’s performance than the evaluation on a single testing dataset.

On a final note

Properly preparing data for machine learning is essential to developing accurate and reliable machine learning solutions. At ITRex, we understand the challenges of data preparation and the importance of having a quality dataset for a successful machine learning process.

Data preparation for machine learning: a step-by-step guide

How to prepare data for machine learning

Data collection

Data cleaning

Data transformation

Data splitting

On a final note

Contact us

Data preparation for machine learning: a step-by-step guide

How to prepare data for machine learning

Data collection

Data cleaning

Data transformation

Data splitting

On a final note

Contact us

Get in touch