← Back to Descriptive Analytics

Filtering & Reducing Noise

Keeping Only the Signal That Matters

Once data is clean, it is rarely focused. A raw table usually carries columns the analysis will never touch, rows outside the question's scope, and extreme values that pull averages and trend lines away from what is actually happening. Filtering narrows the dataset to the rows and columns relevant to the business question; reducing noise removes the random fluctuations and outliers that obscure the underlying pattern. The goal is a smaller, sharper dataset where the signal is easy to see.

Many rows in, focused signal out All rows every column · every period Filter relevance · scope outlier trim Clean signal ready to analyze

Filtering removes irrelevant rows and columns; noise reduction trims the outliers that distort the pattern.

Two Different Jobs

Detecting Outliers

  1. IQR rule: flag values below Q1 − 1.5·IQR or above Q3 + 1.5·IQR. Robust and distribution-free.
  2. Z-score: flag values more than 3 standard deviations from the mean. Works when data is roughly normal.
  3. Domain thresholds: rules the business already knows (e.g., a single order over $50,000 needs review).

Outliers are flagged, not blindly deleted — a spike can be a data error or the single most important event in the dataset. Smoothing (rolling means, see Visualization & Trending) is the gentler alternative when you want to keep every row but calm the volatility.

In Python (pandas)

Python · pandas

import pandas as pd

# Keep only the columns this analysis needs
cols = ["order_date", "region", "category", "segment",
        "sales", "quantity", "discount", "profit"]
df = orders[cols].copy()

# 1. Relevance filter — scope the question (2023 onward, real sales only)
df = df[(df["order_date"] >= "2023-01-01") & (df["sales"] > 0)]

# 2. Noise reduction — trim sales outliers with the IQR rule
q1, q3 = df["sales"].quantile([0.25, 0.75])
iqr = q3 - q1
low, high = q1 - 1.5 * iqr, q3 + 1.5 * iqr
focused = df[df["sales"].between(low, high)]

print(f"{len(orders):,} rows -> {len(df):,} relevant -> {len(focused):,} after outlier trim")

In SQL

SQL

-- Relevance filter + IQR outlier bounds in a single statement.
-- PERCENTILE_CONT / QUALIFY syntax varies slightly by engine
-- (BigQuery, Snowflake, Postgres) -- adjust to your dialect.
WITH bounds AS (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales) AS q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY sales) AS q3
  FROM orders
  WHERE order_date >= DATE '2023-01-01' AND sales > 0
)
SELECT o.*
FROM orders o, bounds b
WHERE o.order_date >= DATE '2023-01-01'
  AND o.sales > 0
  AND o.sales BETWEEN b.q1 - 1.5 * (b.q3 - b.q1)
                  AND b.q3 + 1.5 * (b.q3 - b.q1);

Tools Commonly Used

Best Practices

Example: Focusing the Retail Dataset

For the running retail example, filtering and noise reduction might involve:

← Back to Descriptive Analytics