← Back to Projects

Data Analytics Library — A Reference Toolkit for the Analytics Lifecycle

Python Library · pandas · scikit-learn · Tested & CI · June 2026

Data Analytics Library — the analytics lifecycle as a Python pipeline

A small, heavily-documented Python library that implements the standard stages of a data-analytics project as composable modules. Every stage takes and/or returns a pandas DataFrame, so the modules chain cleanly into a single pipeline. It is written as a reference base and a practical starting point — each module and every non-obvious line is commented in English — and it is the code companion to the Data Analytics: Tackling the Data series.

GitHub Repository Related Series

The Analytics Lifecycle It Implements

Ten modules covering ingestion through machine learning — each answering one question:

Design Principles

Engineering & Quality

Quick Start

Python

from analytics import dataloading, datacleansing, dataexploration

# 1. Load any supported source into a DataFrame
df = dataloading.load_csv("data/sample_sales.csv", parse_dates=["Order Date"])

# 2. Clean it (each step returns a new DataFrame; nothing is mutated in place)
df = datacleansing.standardize_column_names(df)
df = datacleansing.handle_missing(df, strategy="median")

# 3. Explore it
print(dataexploration.overview(df))
print(dataexploration.numeric_summary(df))

Python · predictive + persistence

from analytics import predictiveanalysis, machinelearning

result = predictiveanalysis.train_regressor(df, target="profit", model="random_forest")
print(result["metrics"])                       # {'r2': ..., 'mae': ..., 'rmse': ...}

machinelearning.save_model(result["pipeline"], "profit_model.joblib")

Streaming Without a Broker

The streaming module exposes one StreamSource interface, so a simulated stream and a real Kafka/RabbitMQ source are interchangeable — your pipeline code does not change.

Python · streaming micro-batches

from analytics import datastreaming as ds, datacleansing

# A simulated, infinite event stream -- swap for KafkaStreamSource /
# RabbitMQStreamSource in production (same interface).
source = ds.SimulatedStreamSource(n_messages=250)

for batch in ds.micro_batch(source, batch_size=100):
    clean = datacleansing.standardize_column_names(batch)
    print("batch rows:", len(clean))

Honest Scope

The bundled dataset is synthetic (NumPy-generated) and exists only to demonstrate the pipeline — it is not real business data. The library is built for learning and as a practical starting point, not as a drop-in replacement for a managed platform.

Resources

← Back to Projects