Data Cleaning in Data Science: Procedures, benefits & tools.
If your results are unsatisfactory when developing predictive models, then either your data or your models are at fault. Any data science application begins with the selection of appropriate data. The data format comes next. You cannot expect your analysis to be accurate unless you are confident that the data on which you performed the analysis is free of any errors.
Data cleaning in data science plays a pivotal role in your analysis. It is a fundamental component of the data preparation phases of the machine learning lifecycle. Real-world data is jumbled. It contains misspelt words, incorrect values, and irrelevant or missing data. This information cannot be analysed directly.
In data science, one must complete various data-cleansing steps to ensure that the data are accurate and ready for analysis.
In this article, we will learn various data cleaning techniques in data science, such as removing duplicate and irrelevant data, standardizing data types, correcting data format, and dealing with missing values, among others. Display this To gain practical experience, you can experiment with online datasets.
To learn more about data cleaning techniques on real-world datasets, along with other data pre-processing stages and model-building phases of a data science lifecycle, you can enrol in online data science courses that will provide you with a Data Science Certificate after you learn how to wrangle massive datasets and discover trends in the data from more than 650 expert trainers and 14 courses.
What does data cleaning entail in data science?
The process of identifying and correcting inaccurate data is known as “data cleansing. It may be in the wrong format, be a duplicate, be tainted, be inaccurate, be incomplete, or be irrelevant. The data values representing errors in the data can be corrected in a number of ways. A data pipeline is used to carry out the data cleansing and validation steps of any data science project.
Each stage of a data pipeline ingests input and generates output. The primary benefit of the data pipeline is that each step is small, self-contained, and simpler to inspect. Some data pipeline systems also allow you to resume the pipeline from the middle, thus saving time. In this article, we will examine eight of the most common steps in the data cleansing process.
-Remove duplications
-Remove irrelevant data.
-Standardize capitalization
-Data type conversion
-Handling outliers
-Fix errors
-Transposition of Language
-Handle missing values.
Why is data cleaning important?
As a data scientist with extensive experience, I have rarely encountered flawless data. The data from the real world is noisy and rife with errors. They are not presented in the best format. Consequently, it is imperative that these data points be corrected.
According to estimates, between 80 and 90 percent of a data scientist’s time is spent on data cleaning. The first step in your workflow should be data cleansing. While working with large datasets and combining multiple data sources, you may duplicate or incorrectly classify data. Inaccurate or insufficient data will diminish the precision of your algorithms and outcomes.
As an illustration, consider data in which the gender column is present. There is a chance that the data column will contain records of “male” and “female,” “M” and “F,” “male” and “female,” etc. if the data is being entered manually. All of these values will be considered distinct while performing analysis on the columns. Male, M, male, and MALE all refer to the same information. Such incorrect formats will be identified and corrected during the data cleansing process.
Consider a second scenario in which you are conducting a promotional campaign and have gathered data from various individuals. The information you gathered includes the individual’s name, phone number, email address, age, gender, and other details. If you intend to contact these individuals via cell phone or email, you must ensure that their contact information is accurate.
It should be a 10-digit numeric field for the contact number, and the email should adhere to a predetermined format. There may also be entries that do not include contact information or a physical address. These entries are irrelevant and serve no purpose, so you would like to eliminate them. Data science finds and fixes this problem before the data can be used for analysis or other purposes.
Step 1: Get Rid of Duplicates
Duplicate values are likely to be in your data if you are working with large datasets, multiple data sources, or if you don’t do quality checks before adding an entry.
These redundant values add redundancy to your data and may cause your calculations to be inaccurate. When there are duplicate serial numbers in a dataset, the estimated number of products will be higher than the real number.
Duplicate email addresses or mobile phone numbers may make your message appear more like spam. We deal with these duplicate records by keeping only one copy of each unique observation in our database.
Step 2: Remove Irrelevant Data
Imagine that you are evaluating the after-sales service of a product. You get data that contains various fields like service request date, unique service request number, product serial number, product type, product purchase date, etc.
While these fields appear to be pertinent, the data may also include fields such as attended by (the name of the person who initiated the service request), location of the service centre, customer contact information, etc., which may not serve our purpose if we were to analyse the expected servicing duration for a product. In such situations, we remove fields that are irrelevant to our work. This was the initial column-level validation performed.
The row-level checks are the next step. Assume the customer visited the service centre and was instructed to return in three days to pick up the repaired product. In this instance, we will also assume that there are two separate records in the data that represent the same service number.
The service type for the first record is “first visit,” while the second record’s service type is “pickup.” Since both records represent the same service request number, it is likely that we will eliminate one of them. For our problem statement, we need the first occurrence of the record or those that match the “first visit” service type.
To get good results from data science, we need to understand the data and the problem statement so we can get rid of useless data.
Step 3: Standardize capitalization.
You must ensure that the text in your data is consistent. If your capitalization is inconsistent, it could result in the creation of many false categories.
For example, the column names “Total Sales” and “total sales” are different because most programming languages pay attention to case.
To avoid confusion and maintain uniformity among the column names, we should follow a standard way of providing the column names. The most preferred code case is the snake case or cobra case.
Cobra case is a writing style in which the first letter of each word is written in uppercase and each space is replaced by an underscore (_) character. In contrast, in the case of the snake, the initial letter of each word is written in lowercase, and each space is replaced with an underscore.
The column name “Total Sales” can therefore be written as “Total Sales” in cobra case and “Total Sales” in snake case. Not only do the column titles need to be fixed, but so do the capitalization of the data points.
For instance, when collecting names and email addresses via online forms, surveys, or other means, we may receive responses in various formats.
To prevent duplicate entries from being ignored, we can correct them. The email IDs “my email at hostname.com” and “my email at hostname.com” can be interpreted as separate email IDs; therefore, it is preferable to make all email ID values in the field lowercase.
Likewise, for the email, we can use title case with all words capitalized.
Step 4: Change the data type.
When we work with CSV data in Python, Pandas will try to figure out the types for us. Most of the time, it works, but sometimes we’ll need to help it out.
Text, numeric, and date data types are the most prevalent types of data found in the data. Text data types can accommodate any mixed value, including letters, numbers, and special characters. Text data types include a person’s name, product type, store location, email address, and password, among others.
Numeric data types contain integer or float values, also known as numbers with a decimal point. Having a column with a numeric data type allows you to perform mathematical calculations, such as calculating the minimum, maximum, average, and median, or analysing the distribution using a histogram, box plot, q-q plot, etc.
You cannot perform this numerical analysis if you have a column of numbers formatted as an integer column. So, if the data types aren’t already in the right formats, they need to be changed into them.
The monthly sales of a store, the price of a product, the amount of electricity consumed, etc. are examples of numeric columns. It is important to note, however, that columns such as a numeric ID or phone number should not be represented as numeric columns but as text columns.
Despite the fact that they represent numeric values, operations such as minimum and average on these columns provide no useful information. Therefore, text columns should be used to represent these columns.
If the data type is not correctly identified, the column will be identified as a string or text column. In such situations, the data type of the column and the date format specified in the data must be defined explicitly. The date column may be represented in the following ways:
The date is October 2, 2023.
2023/10/02 \s2-Oct-2023
Step 5: Handling Outliers
In statistics, an outlier is a data point that significantly deviates from the norm. An outlier may reflect measurement variability or indicate an experimental error; the latter is sometimes removed from the data set.
For example, let us consider pizza prices in a region. After surveying 500 restaurants in the region, it was determined that pizza prices range from INR 100 to INR 7500. However, analysis revealed that there is only one record in the dataset with a pizza price of INR 7500, while the rest range from INR 100 to INR 1,500.
Therefore, the observation of a pizza price of INR 7,500 is an outlier because it deviates significantly from the population. Typically, a box plot or scatter plot is used to identify these outliers.
The data is skewed due to these outliers. There are models that assume the data follows a normal distribution, and outliers can affect model performance if the data is skewed; therefore, these outliers must be dealt with before the data is used for model training. There are two common approaches to dealing with outliers.
Remove observations that contain outlier values.
Apply changes to the data values, like a log, square root, box-cox, etc., so that they have a normal or nearly normal distribution.
The Bootcamp for Data Science teaches these and additional data cleaning and manipulation techniques. Six capstone projects and over 280 hours of on-demand, self-paced learning will help you develop your programming and analytic skills as you gain confidence as a data scientist under the guidance of expert professionals.
Step 6: Fix errors
Inaccuracies in your data may cause you to overlook the most important findings. This must be avoided by correcting your data’s potential errors. Systems that rely solely on manual data entry without data validation will almost always contain errors. To fix them, we must first gain a thorough comprehension of the data. After that, we’ll be able to define logic or examine the data and make any findings.
This must be avoided by correcting your data’s potential errors. Systems that rely solely on manual data entry without data validation will almost always contain errors. To fix them, we must first gain a thorough comprehension of the data. After that, we’ll be able to define logic or examine the data and, accordingly, fix any data errors. Consider the following cases as illustrations:
removing the country code from the mobile field so that all values are exactly 10 digits long.
To convert a column such as weight, height, etc. into a numeric field, remove any units that are mentioned.
figuring out when data formats, like email addresses, are wrong and then either fixing them or getting rid of them.
performing validation checks, such as making sure that the date the customer bought the item was later than the date it was made, that the total amount was the same as the sum of the other amounts, and that there were no punctuation marks or other special characters in a field that didn’t allow them.
Seventh Step: Language Translation
Datasets for machine translation are frequently compiled from a variety of sources, which can lead to linguistic inconsistencies. Typically, data evaluation software employs Natural Language Processing (NLP) models that are monolingual and unable to process multiple languages. You must therefore translate everything into one language. There are a few AI models for language translation that can be used for the task.
Step 8: Handle missing values
During cleaning and munging in data science, handling missing values is one of the most common tasks. The real-life data might contain missing values that need to be fixed before they can be used for analysis. We can handle missing values by
Either by removing records with missing values, filling in the missing values with some statistical technique, or gathering and comprehending data.
A rule of thumb is that you can drop the missing values if they make up less than five percent of the total number of records, but it depends on the analysis, the importance of the missing values, the size of the data, and the use case we are working on.
Consider a dataset that contains certain health parameters like glucose, blood pressure, insulin, BMI, age, diabetes, etc. The goal is to create a supervised classification model that predicts if the person is likely to have diabetes or not based on the health parameters.
If the data has missing values for glucose and blood pressure columns for a few individuals, there is no way we can fill these values through any technique. And if these two columns are a good indicator of whether or not a person has diabetes, we should try to get rid of these observations from our records.
Consider another dataset where we have information about the labourers working on a construction site. If the gender column in this dataset has around 30 percent of its values, We cannot drop 30 percent of data observations, but on further digging, we found that among the rest, 70 percent of observations and 90 percent of records are male. Therefore, we can choose to fill these missing values with the male gender.
By doing this, we have made an assumption, but it can be a safe assumption because the labourers working on the construction site are predominantly male, and even the data suggests the same. In this case, we have used a measure of central tendency called Mode. There are also other ways of filling in missing values in a numerical field by using mean or median values based on whether the field values follow a gaussian distribution or not.
Data Cleaning Tools
Microsoft Excel (Common Data Cleaning Tool)
Programming languages (Python, Ruby, SQL) (Python, Ruby, SQL)
Data visualisations (to spot errors in your dataset) (To spot errors in your dataset,)
Proprietary software (OpenRefine, Trifacta, etc.) (OpenRefine, Trifacta, etc.)
Benefits of Data Cleaning in Data Science
Your analysis will be reliable and free of bias if you have clean and correct data collection. We have looked at eight steps for data cleansing in data science. Let us discuss some of the benefits of data science for cleaning.
How to avoid making mistakes: If your data cleaning techniques work, your analysis results will be accurate and consistent.
Cleaning the data makes it possible to keep the data quality high and do more accurate analytics that help with the decision-making process as a whole.
Avoiding unnecessary costs and errors: Keeping track of errors and improving reporting to identify where errors originate makes it easier to correct inaccurate or incorrect data in the future.
Staying organised
Improved mapping
Conclusion
Data science cleaning is a crucial stage in every data analysis task, regardless of whether it’s a basic, arithmetic-based quantitative analysis or you’re utilizing machine learning for your big data applications. To avoid having inaccurate data, this article should help you get started with data cleaning in data science. Although cleaning your data can occasionally take a while, skipping this step will cost you more than just time.
You want the data to be clean before you start your research, since unclean data can cause a whole variety of problems and biases in your results. To know more about how to perform data processing tasks like data cleaning, data acquisition, data munging, etc.,
you can check out Knowledge Hut’s Data Science Certificate online. It offers the chance to learn from experts with real-world experience in data analytics, engineering, and science. With more than 18 courses curated by more than 650 trained experts, you can build the skills to design data science and machine learning models.