Collecting Data
The Foundation of Descriptive Analytics
Collecting data is the first and most critical step in any data analytics process. Without reliable, comprehensive, and well-structured data, even the most advanced models and analyses cannot produce meaningful insights. The goal of this stage is to gather raw information that truly represents the phenomena we want to analyze.
Collection extracts from many sources and loads them into one unified, analysis-ready dataset.
Typical Sources of Data
- Databases: SQL databases such as MySQL, PostgreSQL, Oracle, and enterprise data warehouses.
- APIs: REST or GraphQL APIs providing data from platforms like Twitter, Google Analytics, or financial services.
- Spreadsheets: Excel or CSV files often used in business reporting and small-scale analytics.
- Surveys & Questionnaires: Data collected directly from users or customers.
- System Logs: Information captured from web servers, applications, or IoT devices.
- Open Data Portals: Government datasets (e.g., data.gov, WHO data) and academic repositories.
- Third-Party Data Providers: Commercial data providers for financial, demographic, or market data.
ETL (Extract, Transform, Load) in Data Collection
In professional environments, data collection is often part of an ETL process:
- Extract: Retrieve raw data from multiple sources (databases, APIs, flat files).
- Transform: Clean, normalize, and format data into a consistent structure.
- Load: Store the processed data in a target system, such as a data warehouse or cloud platform, ready for analysis.
In Python — gathering from multiple sources
Python · pandas
import pandas as pd
# Typical collection: read each source, then concatenate into one frame
orders_db = pd.read_sql("SELECT * FROM orders", conn) # database
orders_csv = pd.read_csv("data/legacy_orders.csv") # flat file
orders_api = pd.json_normalize(requests.get(API_URL).json())# API
orders = pd.concat([orders_db, orders_csv, orders_api], ignore_index=True)
orders["order_date"] = pd.to_datetime(orders["order_date"])
print(f"Collected {len(orders):,} rows from {orders['region'].nunique()} regions")
The companion notebook at the end of this series generates a synthetic retail dataset instead, so the whole pipeline runs anywhere with no external files:
Python · synthetic source (notebook)
import numpy as np, pandas as pd
rng = np.random.default_rng(42)
n = 20_000
orders = pd.DataFrame({
"order_id": np.arange(1, n + 1),
"order_date": pd.to_datetime("2022-01-01") +
pd.to_timedelta(rng.integers(0, 730, n), unit="D"),
"region": rng.choice(["West", "East", "Central", "South"], n),
"category": rng.choice(["Furniture", "Office", "Technology"], n),
"segment": rng.choice(["Consumer", "Corporate", "Home Office"], n),
"customer_id":rng.integers(1, 2_000, n),
"quantity": rng.integers(1, 12, n),
"discount": rng.choice([0, 0.1, 0.2, 0.3], n, p=[.5, .25, .15, .1]),
})
In SQL — extracting and unioning sources
SQL
-- Collection in the warehouse: union staged sources into one table
CREATE TABLE staging.orders AS
SELECT order_id, order_date, region, category, segment,
customer_id, quantity, discount
FROM raw.orders_pos
UNION ALL
SELECT order_id, order_date, region, category, segment,
customer_id, quantity, discount
FROM raw.orders_ecommerce;
Best Practices
- Define clear data requirements aligned with the business question.
- Ensure data quality by validating formats, completeness, and accuracy.
- Document data lineage: the origin, transformations, and flow of the data.
- Respect data privacy and compliance (e.g., GDPR, HIPAA).
- Automate data collection pipelines to reduce human error and improve efficiency.
Example: Collecting Retail Sales Data
For the retail example used throughout this series, data may come from multiple sources:
- A point-of-sale database (in-store transactions).
- An e-commerce platform export (online orders).
- A product catalogue providing category and pricing details.
- A customer table providing segment and region.
Integrating these sources through an ETL pipeline gives analysts a single, consistent dataset to clean and analyze.