Data Analytics Library — A Reference Toolkit for the Analytics Lifecycle
Python Library · pandas · scikit-learn · Tested & CI · June 2026
A small, heavily-documented Python library that implements the standard stages of a
data-analytics project as composable modules. Every stage takes and/or returns a pandas
DataFrame, so the modules chain cleanly into a single pipeline. It is written as a
reference base and a practical starting point — each module and every non-obvious
line is commented in English — and it is the code companion to the
Data Analytics: Tackling the Data series.
The Analytics Lifecycle It Implements
Ten modules covering ingestion through machine learning — each answering one question:
- Data Loading (
dataloading) — CSV, text, Excel, JSON, Parquet, and SQL into a DataFrame. - Streaming Ingestion (
datastreaming) — Kafka / RabbitMQ / simulated streams consumed as micro-batch DataFrames behind one common interface. - Data Cleansing (
datacleansing) — column names, whitespace, missing values, duplicates, types, outliers. - Data Exploration (
dataexploration) — numeric/categorical summaries and correlations (EDA). - Data Visualization (
datavisualization) — basic and advanced charts (matplotlib / seaborn). - Descriptive (
descriptiveanalysis) — aggregation, KPIs, trends: what happened? - Diagnostic (
diagnosticanalysis) — correlation-to-target, t-test, chi-square, drill-down: why did it happen? - Predictive (
predictiveanalysis) — regression, classification, clustering: what will happen? - Prescriptive (
prescriptiveanalysis) — scenarios, rules, budget optimization: what should we do? - Machine Learning (
machinelearning) — reusable split / preprocess / pipeline / cross-validate / persist.
Design Principles
- One standard currency: every loader returns a pandas
DataFrame; every transformer takes and returns one. - Non-mutating transformers: cleansing functions
copy()their input, so pipelines are reproducible and side-effect free. - Fail loudly: invalid input raises a clear exception instead of silently returning
None. - Optional dependencies stay optional: plotting, SQL, and stats imports are lazy and raise a helpful message if a package is missing.
- Separation of concerns: numbers (
dataexploration) and charts (datavisualization) live in different modules.
Engineering & Quality
- Tested: a
pytestsuite of 81 tests, one file per module; plotting tests run headless via theAggbackend. - CI: GitHub Actions runs the suite on Python 3.10–3.12 and executes the demo notebook on every push.
- Packaged: pinned
requirements.txtand apyproject.toml; isolatedvenvsetup documented in the README. - Reproducible data:
data/generate_sample_data.pycreates a syntheticsample_sales.csvso the demo runs with no external files.
Quick Start
Python
from analytics import dataloading, datacleansing, dataexploration
# 1. Load any supported source into a DataFrame
df = dataloading.load_csv("data/sample_sales.csv", parse_dates=["Order Date"])
# 2. Clean it (each step returns a new DataFrame; nothing is mutated in place)
df = datacleansing.standardize_column_names(df)
df = datacleansing.handle_missing(df, strategy="median")
# 3. Explore it
print(dataexploration.overview(df))
print(dataexploration.numeric_summary(df))
Python · predictive + persistence
from analytics import predictiveanalysis, machinelearning
result = predictiveanalysis.train_regressor(df, target="profit", model="random_forest")
print(result["metrics"]) # {'r2': ..., 'mae': ..., 'rmse': ...}
machinelearning.save_model(result["pipeline"], "profit_model.joblib")
Streaming Without a Broker
The streaming module exposes one StreamSource interface, so a simulated stream and a real
Kafka/RabbitMQ source are interchangeable — your pipeline code does not change.
Python · streaming micro-batches
from analytics import datastreaming as ds, datacleansing
# A simulated, infinite event stream -- swap for KafkaStreamSource /
# RabbitMQStreamSource in production (same interface).
source = ds.SimulatedStreamSource(n_messages=250)
for batch in ds.micro_batch(source, batch_size=100):
clean = datacleansing.standardize_column_names(batch)
print("batch rows:", len(clean))
Honest Scope
The bundled dataset is synthetic (NumPy-generated) and exists only to demonstrate the pipeline — it is not real business data. The library is built for learning and as a practical starting point, not as a drop-in replacement for a managed platform.