Data Analytics Library – Mario Zamudio

Data Analytics Library — A Reference Toolkit for the Analytics Lifecycle

Python Library · pandas · scikit-learn · Tested & CI · June 2026

Data Analytics Library — the analytics lifecycle as a Python pipeline

A small, heavily-documented Python library that implements the standard stages of a data-analytics project as composable modules. Every stage takes and/or returns a pandas DataFrame, so the modules chain cleanly into a single pipeline. It is written as a reference base and a practical starting point — each module and every non-obvious line is commented in English — and it is the code companion to the Data Analytics: Tackling the Data series.

GitHub Repository Related Series

The Analytics Lifecycle It Implements

Ten modules covering ingestion through machine learning — each answering one question:

Data Loading (dataloading) — CSV, text, Excel, JSON, Parquet, and SQL into a DataFrame.
Streaming Ingestion (datastreaming) — Kafka / RabbitMQ / simulated streams consumed as micro-batch DataFrames behind one common interface.
Data Cleansing (datacleansing) — column names, whitespace, missing values, duplicates, types, outliers.
Data Exploration (dataexploration) — numeric/categorical summaries and correlations (EDA).
Data Visualization (datavisualization) — basic and advanced charts (matplotlib / seaborn).
Descriptive (descriptiveanalysis) — aggregation, KPIs, trends: what happened?
Diagnostic (diagnosticanalysis) — correlation-to-target, t-test, chi-square, drill-down: why did it happen?
Predictive (predictiveanalysis) — regression, classification, clustering: what will happen?
Prescriptive (prescriptiveanalysis) — scenarios, rules, budget optimization: what should we do?
Machine Learning (machinelearning) — reusable split / preprocess / pipeline / cross-validate / persist.

Design Principles

One standard currency: every loader returns a pandas DataFrame; every transformer takes and returns one.
Non-mutating transformers: cleansing functions copy() their input, so pipelines are reproducible and side-effect free.
Fail loudly: invalid input raises a clear exception instead of silently returning None.
Optional dependencies stay optional: plotting, SQL, and stats imports are lazy and raise a helpful message if a package is missing.
Separation of concerns: numbers (dataexploration) and charts (datavisualization) live in different modules.

Engineering & Quality

Tested: a pytest suite of 81 tests, one file per module; plotting tests run headless via the Agg backend.
CI: GitHub Actions runs the suite on Python 3.10–3.12 and executes the demo notebook on every push.
Packaged: pinned requirements.txt and a pyproject.toml; isolated venv setup documented in the README.
Reproducible data: data/generate_sample_data.py creates a synthetic sample_sales.csv so the demo runs with no external files.

Quick Start

Python

from analytics import dataloading, datacleansing, dataexploration

# 1. Load any supported source into a DataFrame
df = dataloading.load_csv("data/sample_sales.csv", parse_dates=["Order Date"])

# 2. Clean it (each step returns a new DataFrame; nothing is mutated in place)
df = datacleansing.standardize_column_names(df)
df = datacleansing.handle_missing(df, strategy="median")

# 3. Explore it
print(dataexploration.overview(df))
print(dataexploration.numeric_summary(df))

Python · predictive + persistence

from analytics import predictiveanalysis, machinelearning

result = predictiveanalysis.train_regressor(df, target="profit", model="random_forest")
print(result["metrics"])                       # {'r2': ..., 'mae': ..., 'rmse': ...}

machinelearning.save_model(result["pipeline"], "profit_model.joblib")

Streaming Without a Broker

The streaming module exposes one StreamSource interface, so a simulated stream and a real Kafka/RabbitMQ source are interchangeable — your pipeline code does not change.

Python · streaming micro-batches

from analytics import datastreaming as ds, datacleansing

# A simulated, infinite event stream -- swap for KafkaStreamSource /
# RabbitMQStreamSource in production (same interface).
source = ds.SimulatedStreamSource(n_messages=250)

for batch in ds.micro_batch(source, batch_size=100):
    clean = datacleansing.standardize_column_names(batch)
    print("batch rows:", len(clean))

Honest Scope

The bundled dataset is synthetic (NumPy-generated) and exists only to demonstrate the pipeline — it is not real business data. The library is built for learning and as a practical starting point, not as a drop-in replacement for a managed platform.

Resources

← Back to Projects