Cleaning Data
Turning Raw Data into Reliable Insights
Cleaning data is a critical step in the analytics process. Raw data often contains errors, missing values, duplicates, or inconsistencies that can lead to misleading results if not addressed. The goal of data cleaning is to prepare a dataset that is accurate, consistent, and ready for meaningful analysis.
Cleaning removes duplicates and errors, fills gaps, and standardizes formats before any analysis begins.
Common Data Issues
- Missing Values: Incomplete records due to system errors, human input mistakes, or unavailable data.
- Duplicates: Multiple identical entries that can skew counts and averages.
- Inconsistent Formats: Different date formats, units of measure, or capitalization.
- Outliers: Extremely high or low values that may distort results if not evaluated carefully.
- Incorrect Data Types: Numeric fields stored as text or categorical values coded inconsistently.
Techniques for Data Cleaning
- Handling Missing Data: Options include imputation (mean, median, mode), predictive modelling, or removing incomplete records.
- Removing Duplicates: Identify and delete redundant rows to avoid double counting.
- Standardizing Formats: Convert values into consistent formats (e.g., YYYY-MM-DD for dates, metric units for measurements).
- Correcting Errors: Fix typos, incorrect spellings, and anomalies through validation rules and cross-checking.
- Dealing with Outliers: Use statistical thresholds (e.g., z-scores, IQR) to detect and decide whether to keep or remove outliers.
In Python — a typical cleaning pass
Python · pandas
import pandas as pd
# 1. Remove exact duplicate orders
orders = orders.drop_duplicates(subset="order_id")
# 2. Fix data types
orders["order_date"] = pd.to_datetime(orders["order_date"], errors="coerce")
orders["quantity"] = pd.to_numeric(orders["quantity"], errors="coerce")
# 3. Standardize categorical text (trim + title case)
for col in ["region", "category", "segment"]:
orders[col] = orders[col].str.strip().str.title()
# 4. Handle missing values: drop rows missing a key, impute the rest
orders = orders.dropna(subset=["order_id", "order_date"])
orders["discount"] = orders["discount"].fillna(0)
print(orders.isna().sum()) # confirm no critical nulls remain
In SQL — cleaning in the warehouse
SQL
-- Deduplicate, standardize text, and drop invalid rows in one view
CREATE OR REPLACE VIEW staging.orders_clean AS
SELECT
order_id,
CAST(order_date AS DATE) AS order_date,
INITCAP(TRIM(region)) AS region,
INITCAP(TRIM(category)) AS category,
COALESCE(discount, 0) AS discount,
quantity
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id
ORDER BY order_date DESC) AS rn
FROM staging.orders
) d
WHERE rn = 1 -- keep one row per order_id
AND order_date IS NOT NULL
AND quantity > 0;
Tools Commonly Used
- Python: Pandas, NumPy, Scikit-learn for imputation, cleaning, and validation.
- SQL: Data profiling, filtering, and cleaning using queries and window functions.
- Power BI & Tableau: Built-in cleaning tools in Power Query and data preparation workflows.
- SPSS: Handling missing values, recoding variables, and outlier analysis.
- ETL Platforms: Talend, Informatica, or Apache Nifi for large-scale automated cleaning.
Best Practices
- Document each cleaning step for data lineage and reproducibility.
- Automate repetitive cleaning tasks where possible.
- Always validate cleaned data against original sources.
- Balance between removing anomalies and preserving valuable information.
- Ensure compliance with data governance and privacy standards.
Example: Cleaning the Retail Orders
For the retail dataset used throughout this series, cleaning may involve:
- Removing duplicate order rows that the e-commerce export emitted twice.
- Coercing
order_dateandquantityto proper date and numeric types. - Standardizing region and category text (e.g., “west”, “West ”, “WEST” → “West”).
- Filling a missing
discountwith 0 and dropping rows with no order date.