The Part Nobody Talks About in Data Science
Everyone wants to build machine learning models. Everyone wants to predict outcomes, run algorithms, and impress recruiters with accuracy scores. But here is what most data science courses skip over quietly: before any of that happens, someone has to clean the data.
And that someone, most of the time, is you.
Professional data scientists spend anywhere between 60 to 80 percent of their working time cleaning and preparing data. Not modeling. Not visualizing. Cleaning. A study from IBM estimated that poor data quality costs the United States economy around 3.1 trillion dollars every year. That number exists because dirty data produces wrong answers, and wrong answers in business mean wrong decisions.
If you are a student preparing for data science interviews or your first job, learning how to clean data properly is not just useful. It is essential. Interviewers at companies like Amazon, Flipkart, and mid-size Indian startups regularly ask candidates to clean a dataset on the spot during technical rounds. Most candidates struggle not because they lack algorithm knowledge, but because they have never actually dealt with real-world messy data before.
This guide changes that.
What Is Messy Data and Where Does It Come From
Messy data is any dataset that cannot be used directly for analysis or modeling. It contains errors, inconsistencies, missing values, or formatting problems that would produce misleading results if left uncleaned.
Real-world data comes from many sources: web forms that users fill carelessly, sensors that sometimes fail, databases that were merged without proper validation, or spreadsheets that different people edited differently over time. None of these sources produce clean data automatically.
A dataset about customer orders might have some entries where the phone number field contains an email address. A hospital record might have a patient’s age listed as 350. A sales file might have the same city spelled five different ways: “Mumbai”, “mumbai”, “MUMBAI”, “Bombay”, and “Mum bai”. All of these are real problems that appear in industry datasets regularly.
Understanding where messy data comes from helps you anticipate problems before you even look at the file.
Setting Up Your Environment
Before writing a single line of cleaning code, you need the right tools. For most data science work in Python, two libraries handle almost everything:
import pandas as pd
import numpy as npPandas is your primary tool for data manipulation. NumPy handles numerical operations that Pandas relies on internally. Install them with:
pip install pandas numpyThroughout this guide, every example uses these two libraries. If you have not worked with Pandas before, treat this guide as your first real introduction to it in a practical context.
Step 1: Load the Data and Take a First Look
The first thing a professional does when receiving a new dataset is spend time understanding what they are working with before trying to fix anything.
df = pd.read_csv("sales_data.csv")
# Check the shape
print(df.shape)
# First five rows
print(df.head())
# Column names and data types
print(df.info())
# Basic statistics
print(df.describe())These four commands tell you almost everything you need to know at the start. The shape tells you how many rows and columns exist. The head shows you what the data actually looks like. The info shows you which columns have missing values and what data type each column is stored as. The describe gives you the statistical range of numerical columns, which immediately reveals outliers.
A professional data scientist never skips this step. It saves hours of debugging later.
Step 2: Handle Missing Values the Right Way
Missing values are the most common problem in any real dataset. Pandas represents them as NaN (Not a Number). The wrong approach is to drop every row with a missing value and move on. That approach destroys valuable data and introduces bias into your analysis.
First, find out how many values are missing in each column:
print(df.isnull().sum())Now you have a clear picture. The next decision depends on the context of each column.
Dropping rows: Only do this when very few rows have missing values and those rows are not representative of anything important.
df.dropna(subset=["critical_column"], inplace=True)Filling with mean or median: Use this for numerical columns when the missing values are random. Median is more reliable than mean when your data has outliers.
df["age"].fillna(df["age"].median(), inplace=True)Filling with mode: Use this for categorical columns like city names or product categories.
df["city"].fillna(df["city"].mode()[0], inplace=True)Forward fill: Useful for time series data where the previous value is a reasonable estimate for a missing one.
df["temperature"].fillna(method="ffill", inplace=True)The key insight here is that there is no single correct method. A professional chooses the technique based on what the data represents and why the values are missing. This reasoning is exactly what interviewers want to hear when they ask you about handling missing data.
Step 3: Fix Incorrect Data Types
This is a problem that silently breaks everything downstream if you do not catch it early. A column that stores dates as plain text will not sort correctly. A column that stores prices as strings cannot be used in calculations.
# Check current types
print(df.dtypes)
# Convert a date column stored as text
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
# Convert a price column stored as string
df["price"] = pd.to_numeric(df["price"], errors="coerce")The errors="coerce" parameter is important. It tells Pandas that if a value cannot be converted, replace it with NaN instead of throwing an error. This prevents your entire cleaning process from crashing on a single bad value.
Step 4: Remove Duplicate Records
Duplicate rows appear in datasets more often than you would expect. When two databases are merged, when a form is submitted twice, or when data is imported multiple times without deduplication checks, you end up with records that count the same transaction or event more than once.
python
# Check how many duplicates exist
print(df.duplicated().sum())
# Remove them
df.drop_duplicates(inplace=True)Sometimes duplicates are not exact. The same customer might appear twice with slightly different spellings of their name but the same phone number. Handling near-duplicates requires more careful judgment, often involving grouping by a reliable unique identifier like phone number or email rather than name.
Step 5: Standardize Text and Categories
This is one of the most overlooked steps among beginners and one of the most important in real projects.
When humans enter data manually, they are inconsistent. “Male”, “male”, “M”, “MALE” should all represent the same value, but a program treating them as text will see four completely different categories.
# Lowercase everything
df["gender"] = df["gender"].str.lower().str.strip()
# Replace known variations
df["gender"] = df["gender"].replace({
"m": "male",
"f": "female",
"fem": "female"
})For city names or product names, you often need to build a mapping dictionary that accounts for all the variations present in your specific dataset. This cannot be fully automated. A professional looks at the unique values in a column and decides which should map to which.
print(df["city"].value_counts())This one line shows you every unique value in a column and how often it appears. It is your best friend when standardizing text data.
Step 6: Detect and Handle Outliers
An outlier is a data point that is significantly different from the rest. Some outliers are genuine: a customer who placed an order worth ten lakh rupees is unusual but real. Other outliers are errors: an age field showing 999 is clearly a data entry mistake.
The simplest method to detect outliers in numerical columns is to look at the range from your describe output. If the maximum value of an age column is 150 and the minimum is 0, you already know something is wrong.
A more systematic approach uses the IQR method:
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
print(outliers)Once you have identified outliers, you have three options. Remove them if they are clearly errors. Cap them at the boundary values if they are extreme but possibly real. Keep them if they represent genuine rare events that your analysis needs to account for.
Again, the right choice depends on the problem you are solving, not on a fixed rule.
Step 7: Rename Columns for Clarity
This step takes two minutes and makes every subsequent line of code easier to write and read.
df.rename(columns={
"Cust_Nm": "customer_name",
"Ord_Dt": "order_date",
"Sal_Amt": "sale_amount"
}, inplace=True)Professional data scientists use lowercase column names with underscores. No spaces, no special characters, no abbreviations that only the original data creator would understand. Clean column names make collaboration easier and reduce bugs.
Step 8: Validate Your Cleaned Data
After all the cleaning steps, take one more look at your dataset before moving forward.
# Check missing values again
print(df.isnull().sum())
# Check shape to see how many rows remain
print(df.shape)
# Check data types again
print(df.dtypes)
# Check value counts for categorical columns
print(df["gender"].value_counts())This final validation step catches anything you missed and confirms that your transformations worked as intended. Professional data scientists treat this as a non-negotiable checkpoint.
A Complete Data Cleaning Function You Can Reuse
Here is a basic template that covers the most common cleaning tasks in a single reusable function. Adapt it for your projects:
def clean_dataset(df):
# Remove duplicates
df.drop_duplicates(inplace=True)
# Strip whitespace from string columns
for col in df.select_dtypes(include="object").columns:
df[col] = df[col].str.strip().str.lower()
# Fill missing numerical values with median
for col in df.select_dtypes(include="number").columns:
df[col].fillna(df[col].median(), inplace=True)
# Fill missing categorical values with mode
for col in df.select_dtypes(include="object").columns:
df[col].fillna(df[col].mode()[0], inplace=True)
return df
df = clean_dataset(df)
print("Cleaning complete. Shape:", df.shape)This is not a solution you would use blindly in every project. It is a starting point that handles the basics automatically, which you then customize based on what the specific dataset needs.
What Interviewers Actually Ask About Data Cleaning
Technical interviews for data science roles at Indian companies regularly include questions on this exact topic. Here are the ones you should be prepared for:
“How do you handle missing values?” – Do not just say “I drop them.” Explain that your approach depends on how much data is missing, why it is missing, and what the column represents. Show that you think before acting.
“What is the difference between mean imputation and median imputation?” – Mean is sensitive to outliers. If your column has extreme values, mean imputation will pull the filled values in a misleading direction. Median is more robust for skewed data.
“How would you detect outliers in a dataset?” – Mention both the IQR method and visual techniques like box plots. Show that you know more than one approach.
“Have you ever worked with real messy data? What challenges did you face?” – This is where your projects matter. If you have cleaned actual datasets from Kaggle, government open data portals, or your own scraped data, talk about a specific problem you encountered and how you solved it.
“What is the risk of removing too many rows during cleaning?” – The answer is data loss and potential bias. If the rows you remove are not random, the cleaned dataset may no longer represent the original population accurately.
Prepare specific answers to all of these before walking into any data science interview.
Where to Find Real Messy Datasets to Practice On
Reading about data cleaning is one thing. Actually doing it on real data is something else. Here are the best places to find datasets that will challenge you:
Kaggle (kaggle.com) has hundreds of real-world datasets across different industries. Many of them are deliberately messy because they come from actual company data.
data.gov.in is India’s open government data portal. Datasets on agriculture, health, transport, and education are available here and most of them require significant cleaning before use.
UCI Machine Learning Repository has classic datasets that are widely used in academic research and interview preparation.
Start with a dataset that interests you. The more relevant it feels to your life, the more motivated you will stay through the frustrating parts of cleaning it.
The Mindset That Separates Good Data Scientists from Average Ones
Technical skills are learnable. Everyone can memorize a fillna command. What separates strong data scientists from average ones is the habit of asking why before asking how.
Before filling a missing value, ask why it is missing. Before removing an outlier, ask whether it reveals something important about your data. Before standardizing a category, ask whether the variation itself is meaningful.
Messy data is not just a technical problem. It is an information problem. Every inconsistency in a dataset has a reason, and understanding that reason helps you clean the data in a way that preserves its truth rather than distorting it.
That kind of thinking is what companies are really hiring for when they say they want a “data scientist.” They want someone who cleans data with judgment, not just with code.
Conclusion
Data cleaning is not the glamorous part of data science. It is the part that makes everything else possible. Students who invest time in learning it properly arrive at interviews with something most candidates lack: real experience with the messiness of actual data.
Start with a dataset that feels relevant to you. Use the steps in this guide. Make mistakes, debug them, and build the instinct that comes only from doing the work.
By the time you walk into your first data science interview, data cleaning should feel like second nature. And when an interviewer slides a messy CSV across the table and asks you to make sense of it, you will know exactly where to begin.
Call To Action
Take the next step toward a successful career in data science.
Enroll now in the Data Science course near Noida Sector 62.
Contact Details
Website https://www.tuxacademy.org
Phone +91 7982029314
Email info@tuxacademy.org
Visit the nearest center or book a free counseling session.
Our Location:
Data Science Course
Geetanjali Mehra Expert AI and Data Science Mentor at TuxAcademy
Data Science Course Training in Chennai
Data Science Course Training in Mumbai
Data Science Course in New Delhi
Data Science Course in Noida
Data Science Training Course in Delhi
Data Science Training Course in Greater Noida
Data Science Training Course in Noida
Data Science Course Training in Bengaluru
Data Science Training Course in Delhi NCR
Data Science Course Near Me
Data Science Course in Greater Noida West
Data Science Course in Noida Sector 62
Data Science Course in Delhi Laxmi Nagar

