← Back to Descriptive Analytics

Cleaning Data

Turning Raw Data into Reliable Insights

Cleaning data is a critical step in the analytics process. Raw data often contains errors, missing values, duplicates, or inconsistencies that can lead to misleading results if not addressed. The goal of data cleaning is to prepare a dataset that is accurate, consistent, and ready for meaningful analysis.

Messy in, trustworthy out Raw data dupes · nulls bad types · typos Cleaning dedupe · impute standardize Reliable dataset consistent · accurate

Cleaning removes duplicates and errors, fills gaps, and standardizes formats before any analysis begins.

Common Data Issues

Techniques for Data Cleaning

  1. Handling Missing Data: Options include imputation (mean, median, mode), predictive modelling, or removing incomplete records.
  2. Removing Duplicates: Identify and delete redundant rows to avoid double counting.
  3. Standardizing Formats: Convert values into consistent formats (e.g., YYYY-MM-DD for dates, metric units for measurements).
  4. Correcting Errors: Fix typos, incorrect spellings, and anomalies through validation rules and cross-checking.
  5. Dealing with Outliers: Use statistical thresholds (e.g., z-scores, IQR) to detect and decide whether to keep or remove outliers.

In Python — a typical cleaning pass

Python · pandas

import pandas as pd

# 1. Remove exact duplicate orders
orders = orders.drop_duplicates(subset="order_id")

# 2. Fix data types
orders["order_date"] = pd.to_datetime(orders["order_date"], errors="coerce")
orders["quantity"]   = pd.to_numeric(orders["quantity"], errors="coerce")

# 3. Standardize categorical text (trim + title case)
for col in ["region", "category", "segment"]:
    orders[col] = orders[col].str.strip().str.title()

# 4. Handle missing values: drop rows missing a key, impute the rest
orders = orders.dropna(subset=["order_id", "order_date"])
orders["discount"] = orders["discount"].fillna(0)

print(orders.isna().sum())   # confirm no critical nulls remain

In SQL — cleaning in the warehouse

SQL

-- Deduplicate, standardize text, and drop invalid rows in one view
CREATE OR REPLACE VIEW staging.orders_clean AS
SELECT
  order_id,
  CAST(order_date AS DATE)              AS order_date,
  INITCAP(TRIM(region))                 AS region,
  INITCAP(TRIM(category))               AS category,
  COALESCE(discount, 0)                 AS discount,
  quantity
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id
                               ORDER BY order_date DESC) AS rn
  FROM staging.orders
) d
WHERE rn = 1                 -- keep one row per order_id
  AND order_date IS NOT NULL
  AND quantity > 0;

Tools Commonly Used

Best Practices

Example: Cleaning the Retail Orders

For the retail dataset used throughout this series, cleaning may involve:

← Back to Descriptive Analytics