← Back to Descriptive Analytics

Collecting Data

The Foundation of Descriptive Analytics

Collecting data is the first and most critical step in any data analytics process. Without reliable, comprehensive, and well-structured data, even the most advanced models and analyses cannot produce meaningful insights. The goal of this stage is to gather raw information that truly represents the phenomena we want to analyze.

Many sources, one analysis-ready table Databases & warehouses APIs & flat files Logs & sensors ETL extract · transform · load Raw dataset unified

Collection extracts from many sources and loads them into one unified, analysis-ready dataset.

Typical Sources of Data

ETL (Extract, Transform, Load) in Data Collection

In professional environments, data collection is often part of an ETL process:

  1. Extract: Retrieve raw data from multiple sources (databases, APIs, flat files).
  2. Transform: Clean, normalize, and format data into a consistent structure.
  3. Load: Store the processed data in a target system, such as a data warehouse or cloud platform, ready for analysis.

In Python — gathering from multiple sources

Python · pandas

import pandas as pd

# Typical collection: read each source, then concatenate into one frame
orders_db   = pd.read_sql("SELECT * FROM orders", conn)      # database
orders_csv  = pd.read_csv("data/legacy_orders.csv")          # flat file
orders_api  = pd.json_normalize(requests.get(API_URL).json())# API

orders = pd.concat([orders_db, orders_csv, orders_api], ignore_index=True)
orders["order_date"] = pd.to_datetime(orders["order_date"])
print(f"Collected {len(orders):,} rows from {orders['region'].nunique()} regions")

The companion notebook at the end of this series generates a synthetic retail dataset instead, so the whole pipeline runs anywhere with no external files:

Python · synthetic source (notebook)

import numpy as np, pandas as pd
rng = np.random.default_rng(42)

n = 20_000
orders = pd.DataFrame({
    "order_id":   np.arange(1, n + 1),
    "order_date": pd.to_datetime("2022-01-01") +
                  pd.to_timedelta(rng.integers(0, 730, n), unit="D"),
    "region":     rng.choice(["West", "East", "Central", "South"], n),
    "category":   rng.choice(["Furniture", "Office", "Technology"], n),
    "segment":    rng.choice(["Consumer", "Corporate", "Home Office"], n),
    "customer_id":rng.integers(1, 2_000, n),
    "quantity":   rng.integers(1, 12, n),
    "discount":   rng.choice([0, 0.1, 0.2, 0.3], n, p=[.5, .25, .15, .1]),
})

In SQL — extracting and unioning sources

SQL

-- Collection in the warehouse: union staged sources into one table
CREATE TABLE staging.orders AS
SELECT order_id, order_date, region, category, segment,
       customer_id, quantity, discount
FROM raw.orders_pos
UNION ALL
SELECT order_id, order_date, region, category, segment,
       customer_id, quantity, discount
FROM raw.orders_ecommerce;

Best Practices

Example: Collecting Retail Sales Data

For the retail example used throughout this series, data may come from multiple sources:

Integrating these sources through an ETL pipeline gives analysts a single, consistent dataset to clean and analyze.

← Back to Descriptive Analytics