Assignment 4: Creating Reports and Dashboards for Predictive and Prescriptive Analysis¶

Data Exploration and Preparation¶

Part 2: Prescriptive Analysis with Bike Sharing Dataset¶

Prediction¶

Mario Zamudio (NF1002499)¶

University of Niagara Falls Canada

Master of Data Analytics

Data Warehousing/Visualization (CPSC-510-7)

Winter 2025

Ahmed Eltahawi¶

March 25, 2025

Loading Libraries to use¶

In this step, we import all the libraries to use in the process.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

Loading Dataset and Basic information¶

In this section, the data is loaded and some preprocessing is done after to review the content of the columns in the next steps.

In [2]:
# Load the dataset
df = pd.read_csv("Bike_Sharing_Cleaned.csv")
df.head()
Out[2]:
dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed cnt
0 2011-01-01 1 0 1 0 6 0 2 0.344167 0.363625 0.805833 0.160446 985
1 2011-01-02 1 0 1 0 0 0 2 0.363478 0.353739 0.696087 0.248539 801
2 2011-01-03 1 0 1 0 1 1 1 0.196364 0.189405 0.437273 0.248309 1349
3 2011-01-04 1 0 1 0 2 1 1 0.200000 0.212122 0.590435 0.160296 1562
4 2011-01-05 1 0 1 0 3 1 1 0.226957 0.229270 0.436957 0.186900 1600

Defining the dataset¶

This piece of code is preparing the data for training a machine learning model. First, it separates the dataset into two parts: the features (which are all the input variables used to make predictions) and the target variable (which is the value to be predicted, in this case, the house sale price). Then, it splits the data into a training set and a testing set. The training set is used to teach the model, while the testing set is reserved to evaluate the model’s performance. Specifically, 80% of the data is used for training and 20% for testing. The random seed is fixed to ensure the results are reproducible.

In [5]:
# Drop unneeded columns
df = df.drop(columns=['dteday'])

## Separate features (X) and target variable (y)
X = df.drop('cnt', axis=1)
y = df['cnt']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Initializing and Training Regression Models on the Full Dataset¶

This section creates and trains four different machine learning models using all available features in the dataset:

  • Linear Regression: A simple model that assumes a linear relationship between the features and the target variable (SalePrice).
  • Decision Tree Regressor: A non-linear model that splits the data into branches based on feature thresholds to make predictions.
  • Random Forest Regressor: An ensemble method that builds multiple decision trees and averages their predictions to improve accuracy and reduce overfitting.
  • Gradient Boosting Regressor: Another ensemble technique that builds trees sequentially, each one correcting the errors of the previous, often yielding very high accuracy.

All models are trained using the training portion of the dataset (X_train, y_train). The random_state=42 ensures reproducible results.

In [6]:
# Initialize and Train Models
lr_model = LinearRegression().fit(X_train, y_train)
dt_model = DecisionTreeRegressor(random_state=42).fit(X_train, y_train)
rf_model = RandomForestRegressor(random_state=42).fit(X_train, y_train)
gb_model = GradientBoostingRegressor(random_state=42).fit(X_train, y_train)

Predictions and Evaluation¶

This section uses the trained models to make predictions on the testing data. It then evaluates how well each model performed by calculating two key metrics:

  • Mean Squared Error (MSE) – measures the average squared difference between the actual and predicted values. Lower values indicate better performance.
  • R² Score (coefficient of determination) – indicates how well the model explains the variability of the target variable. A value closer to 1 means the model makes accurate predictions.

The results for both the linear regression and decision tree models are then organized into a table for easy comparison.

In [7]:
# Make Predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_dt = dt_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)

# Evaluate Models
results_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting'],
    'MSE': [
        mean_squared_error(y_test, y_pred_lr),
        mean_squared_error(y_test, y_pred_dt),
        mean_squared_error(y_test, y_pred_rf),
        mean_squared_error(y_test, y_pred_gb)
    ],
    'R2 Score': [
        r2_score(y_test, y_pred_lr),
        r2_score(y_test, y_pred_dt),
        r2_score(y_test, y_pred_rf),
        r2_score(y_test, y_pred_gb)
    ]
})
results_df
Out[7]:
Model MSE R2 Score
0 Linear Regression 7.125135e+05 0.822311
1 Decision Tree 1.017923e+06 0.746146
2 Random Forest 4.618915e+05 0.884812
3 Gradient Boosting 4.289112e+05 0.893036

Visualizing the Most Influential Features in the Models¶

This section calculates and displays the importance of each feature used by the models in predicting house prices. It creates a ranked table showing how much each feature contributes to the model's decisions, then visualizes the top 15 most impactful features using a horizontal bar chart. This helps identify which variables are most influential in determining the number of sharings.

In [25]:
# Feature Importance from Random Forest
feature_importance_rf = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 3))
sns.barplot(data=feature_importance_rf.head(10), x='Importance', y='Feature')
plt.title('Top 10 Feature Importances from Random Forest')
plt.tight_layout()
plt.show()

# Feature Importance from Decission Tree
feature_importance_rf = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 3))
sns.barplot(data=feature_importance_rf.head(10), x='Importance', y='Feature')
plt.title('Top 10 Feature Importances from Decission Tree')
plt.tight_layout()
plt.show()

# Feature Importance from Gradient Boost
feature_importance_rf = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 3))
sns.barplot(data=feature_importance_rf.head(10), x='Importance', y='Feature')
plt.title('Top 10 Feature Importances from Gradient Boost')
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Analyzing Feature Relationships with Sale Price Using a Correlation Heatmap¶

This part of the analysis computes the correlation between all numeric variables in the dataset and the target variable SalePrice. It then generates a heatmap to visually display how strongly each feature is linearly related to house prices. Features with higher positive correlations are more directly associated with higher sale prices, while negative values would indicate inverse relationships. The heatmap highlights the most influential variables based on their correlation strength, aiding in feature selection and understanding of the data.

In [16]:
correlation = df.corr(numeric_only=True)
plt.figure(figsize=(12, 8))
sns.heatmap(
    correlation[['cnt']].sort_values(by='cnt', ascending=False),
    annot=True,
    cmap='coolwarm',
    fmt=".2f"
)
plt.title('Correlation of Features with Count Sharings')
plt.tight_layout()
plt.show()
No description has been provided for this image

Exploring the Top 5 Features Most Correlated with Quantity of Sharings¶

This section identifies the 5 features in the dataset that have the highest positive correlation with the target variable cnt. Once these features are selected, it generates a series of scatter plots—one for each feature—showing how house prices vary with respect to each of them. These visualizations help reveal patterns, trends, or potential outliers in the data, and are useful for understanding the nature of the relationships (linear or non-linear) between the most influential features and quantity of sharings.

In [18]:
top_corr_features = (
    df.corr(numeric_only=True)['cnt']
    .drop('cnt')
    .sort_values(ascending=False)
    .head(5)
    .index
    .tolist()
)

plt.figure(figsize=(18, 20))
for i, feature in enumerate(top_corr_features):
    plt.subplot(5, 2, i + 1)
    sns.scatterplot(data=df, x=feature, y='cnt', alpha=0.6)
    plt.title(f'{feature} vs cnt')
plt.tight_layout()
plt.show()
No description has been provided for this image

Model Evaluation Using Only the Top 5 Correlated Features¶

This section narrows the model's input to just the 5 features most strongly correlated with house prices (SalePrice). It performs the following steps:

  • Feature Selection: Uses only the top 5 correlated features as input variables (X_top5) and keeps cnt as the target.
  • Train-Test Split: Splits this reduced dataset into training and testing sets (80/20).
  • Model Training: Trains four different regression models using the reduced feature set:
    • Linear Regression
    • Decision Tree
    • Random Forest
    • Gradient Boosting
  • Prediction & Evaluation: Each model makes predictions on the test set. Their performance is measured using:
    • Mean Squared Error (MSE): Measures the average squared difference between actual and predicted prices.
    • R² Score: Indicates how much of the variability in sale prices is explained by the model.
  • Comparison: The new results are combined with the previous full-feature model results to directly compare performance when using all features vs. only the top 5.

This helps assess whether reducing feature dimensionality significantly impacts model accuracy and helps balance performance vs. simplicity.

In [22]:
X_top5 = df[top_corr_features]
y_top5 = df['cnt']

X_train_5, X_test_5, y_train_5, y_test_5 = train_test_split(
    X_top5, y_top5, test_size=0.2, random_state=42
)

lr_5 = LinearRegression().fit(X_train_5, y_train_5)
dt_5 = DecisionTreeRegressor(random_state=42).fit(X_train_5, y_train_5)
rf_5 = RandomForestRegressor(random_state=42).fit(X_train_5, y_train_5)
gb_5 = GradientBoostingRegressor(random_state=42).fit(X_train_5, y_train_5)

y_pred_lr_5 = lr_5.predict(X_test_5)
y_pred_dt_5 = dt_5.predict(X_test_5)
y_pred_rf_5 = rf_5.predict(X_test_5)
y_pred_gb_5 = gb_5.predict(X_test_5)

results_top5 = {
    'Model': ['Linear Regression (Top 5)', 'Decision Tree (Top 5)',
              'Random Forest (Top 5)', 'Gradient Boosting (Top 5)'],
    'MSE': [
        mean_squared_error(y_test_5, y_pred_lr_5),
        mean_squared_error(y_test_5, y_pred_dt_5),
        mean_squared_error(y_test_5, y_pred_rf_5),
        mean_squared_error(y_test_5, y_pred_gb_5)
    ],
    'R2 Score': [
        r2_score(y_test_5, y_pred_lr_5),
        r2_score(y_test_5, y_pred_dt_5),
        r2_score(y_test_5, y_pred_rf_5),
        r2_score(y_test_5, y_pred_gb_5)
    ]
}

combined_results = pd.concat(
    [results_df, pd.DataFrame(results_top5)],
    axis=0
).reset_index(drop=True)

combined_results
Out[22]:
Model MSE R2 Score
0 Linear Regression 7.125135e+05 0.822311
1 Decision Tree 1.017923e+06 0.746146
2 Random Forest 4.618915e+05 0.884812
3 Gradient Boosting 4.289112e+05 0.893036
4 Linear Regression (Top 5) 9.593554e+05 0.760752
5 Decision Tree (Top 5) 1.171677e+06 0.707803
6 Random Forest (Top 5) 8.388508e+05 0.790804
7 Gradient Boosting (Top 5) 8.414969e+05 0.790144

Visual Comparison of R² Scores Across Models¶

This part creates a bar chart to visually compare the R² scores of different machine learning models trained on:

  • The full set of features
  • The top 5 features most correlated with count of sharings

Each bar represents a model's R² score, which indicates how well the model explains the variability in house prices. Higher values (closer to 1) mean better performance. The chart helps quickly identify which models perform best and how much accuracy is affected by reducing the feature set. Using hue='Model' ensures future compatibility with newer versions of Seaborn.

In [23]:
# Bar Chart: R² Score Comparison (Future-Proof)
plt.figure(figsize=(12, 6))
sns.barplot(
    data=combined_results,
    x='Model',
    y='R2 Score',
    hue='Model',
    palette='Blues_d',
    legend=False
)
plt.title('R² Score Comparison of Models (Full vs Top 5 Features)')
plt.ylabel('R² Score')
plt.xlabel('Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image

Visual Comparison of Mean Squared Error (MSE) Across Models¶

This section generates a bar chart to compare the prediction error of different models using:

  • All available features
  • Only the top 5 features most correlated with count of sharings

The Mean Squared Error (MSE) quantifies the average squared difference between predicted and actual house prices. Lower values indicate more accurate models. This visualization helps quickly identify which models produce the smallest prediction errors, and whether reducing the number of input features leads to a significant change in model accuracy. It also uses hue='Model' to ensure compatibility with future versions of Seaborn.

In [24]:
# Bar Chart: MSE Comparison (Future-Proof)
plt.figure(figsize=(12, 6))
sns.barplot(
    data=combined_results,
    x='Model',
    y='MSE',
    hue='Model',
    palette='Reds_d',
    legend=False
)
plt.title('MSE Comparison of Models (Full vs Top 10 Features)')
plt.ylabel('Mean Squared Error')
plt.xlabel('Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image