Assignment 4: Creating Reports and Dashboards for Predictive and Prescriptive Analysis¶

Data Exploration and Preparation¶

Part 2: Prescriptive Analysis with Bike Sharing Dataset¶

Preparation¶

Mario Zamudio (NF1002499)¶

University of Niagara Falls Canada

Master of Data Analytics

Data Warehousing/Visualization (CPSC-510-7)

Winter 2025

Ahmed Eltahawi¶

March 25, 2025

Loading Libraries to use¶

In this step, we import all the libraries to use in the process.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

Loading Dataset and Basic information¶

In this section, the data is loaded and some preprocessing is done after to review the content of the columns in the next steps.

  • Data Loading: Reads a CSV file named Bike Sharing Dataset.csv and loads it into a DataFrame for analysis.
  • Shape Inspection: Displays the number of rows and columns in the dataset to understand its size and structure.
  • Column Removal: Drops'instant', 'casual', 'registered', likely because they contain identifiers or metadata that are not useful for analysis.
  • Data Summary: Prints structural information about the dataset, including column names, data types, and non-null counts, helping to confirm data integrity.
  • Preview: Shows the first five rows of the dataset to get an overview of the data’s contents and layout.
  • Descriptive Statistics: Displays summary statistics (like mean, median, min, max, etc.) for both numerical and categorical variables, rounded to one decimal for better readability.
In [2]:
# Load dataset
df = pd.read_csv("Bike Sharing Dataset.csv")

# Print the shape of the dataset (rows, columns) to understand its size.
print("\n--- Shape of Dataset ---")
print(df.shape)

# Drop unneeded columns
df = df.drop(columns=['instant', 'casual', 'registered'])

# Display basic information about the dataset.
# This includes column names, non-null counts, and data types.
# It's useful for checking the integrity and structure of the data.
print("\n--- Dataset Information ---")
df.info()

# Display the first five rows of the dataset.
# Helps in gaining a quick look at the structure and sample values.
print("\n--- First five rows ---")
print(df.head())

# Summary statistics
# Provides descriptive statistics (count, mean, std, min, max, etc.)
# for both numerical and categorical columns.
# Rounded to 1 decimal for better readability.
print("\n--- Summary Statistics ---")
print(df.describe(include='all').round(1))
--- Shape of Dataset ---
(731, 16)

--- Dataset Information ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      731 non-null    object 
 1   season      731 non-null    int64  
 2   yr          731 non-null    int64  
 3   mnth        731 non-null    int64  
 4   holiday     731 non-null    int64  
 5   weekday     731 non-null    int64  
 6   workingday  731 non-null    int64  
 7   weathersit  731 non-null    int64  
 8   temp        731 non-null    float64
 9   atemp       731 non-null    float64
 10  hum         731 non-null    float64
 11  windspeed   731 non-null    float64
 12  cnt         731 non-null    int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 74.4+ KB

--- First five rows ---
       dteday  season  yr  mnth  holiday  weekday  workingday  weathersit  \
0  2011-01-01       1   0     1        0        6           0           2   
1  2011-01-02       1   0     1        0        0           0           2   
2  2011-01-03       1   0     1        0        1           1           1   
3  2011-01-04       1   0     1        0        2           1           1   
4  2011-01-05       1   0     1        0        3           1           1   

       temp     atemp       hum  windspeed   cnt  
0  0.344167  0.363625  0.805833   0.160446   985  
1  0.363478  0.353739  0.696087   0.248539   801  
2  0.196364  0.189405  0.437273   0.248309  1349  
3  0.200000  0.212122  0.590435   0.160296  1562  
4  0.226957  0.229270  0.436957   0.186900  1600  

--- Summary Statistics ---
            dteday  season     yr   mnth  holiday  weekday  workingday  \
count          731   731.0  731.0  731.0    731.0    731.0       731.0   
unique         731     NaN    NaN    NaN      NaN      NaN         NaN   
top     2011-01-01     NaN    NaN    NaN      NaN      NaN         NaN   
freq             1     NaN    NaN    NaN      NaN      NaN         NaN   
mean           NaN     2.5    0.5    6.5      0.0      3.0         0.7   
std            NaN     1.1    0.5    3.5      0.2      2.0         0.5   
min            NaN     1.0    0.0    1.0      0.0      0.0         0.0   
25%            NaN     2.0    0.0    4.0      0.0      1.0         0.0   
50%            NaN     3.0    1.0    7.0      0.0      3.0         1.0   
75%            NaN     3.0    1.0   10.0      0.0      5.0         1.0   
max            NaN     4.0    1.0   12.0      1.0      6.0         1.0   

        weathersit   temp  atemp    hum  windspeed     cnt  
count        731.0  731.0  731.0  731.0      731.0   731.0  
unique         NaN    NaN    NaN    NaN        NaN     NaN  
top            NaN    NaN    NaN    NaN        NaN     NaN  
freq           NaN    NaN    NaN    NaN        NaN     NaN  
mean           1.4    0.5    0.5    0.6        0.2  4504.3  
std            0.5    0.2    0.2    0.1        0.1  1937.2  
min            1.0    0.1    0.1    0.0        0.0    22.0  
25%            1.0    0.3    0.3    0.5        0.1  3152.0  
50%            1.0    0.5    0.5    0.6        0.2  4548.0  
75%            2.0    0.7    0.6    0.7        0.2  5956.0  
max            3.0    0.9    0.8    1.0        0.5  8714.0  

Checking Missing Values¶

Identifies if there are any missing values in the dataset, which is crucial for data cleaning, after the first check of the data.

In [3]:
# Check for missing values
# Identifies if there are any missing values in the dataset, which is crucial for data cleaning.
print("\nMissing Values:")
print(df.isnull().sum())
Missing Values:
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
cnt           0
dtype: int64

Reviewing Distribution of Numerical Variables¶

In this piece of code, we check the basic information to detect trends, outliers and correlations to be used in the next steps of the process.

  • Identifying Numeric Columns: Selects all columns in the dataset that contain numerical data types to prepare them for visualization.
  • Creating Histograms: Generates histograms for each numeric feature to show the distribution of their values (e.g., how frequently values occur within certain ranges). This helps in detecting skewness, peaks, or gaps in the data.
  • Adjusting Layout and Titles: Sets a title for the collection of histograms, arranges the plots to fit nicely within the figure, and displays them.
  • Generating Correlation Heatmap: Creates a heatmap that visualizes the correlation matrix of the dataset, showing the strength and direction of relationships between numeric features.
  • Styling the Heatmap: Applies a color scheme to indicate correlation strength and removes internal annotations to keep the visual clean.
  • Displaying the Heatmap: Presents the finalized heatmap chart to help identify patterns or multicollinearity among features.
In [4]:
# Visualizing distributions of numerical features
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[numeric_cols].hist(bins=20, figsize=(15, 10))
plt.suptitle('Distributions of Numeric Features')
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(15, 15))
sns.heatmap(df[numeric_cols].corr(), annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image
No description has been provided for this image

Data Cleaning¶

The following steps involve coding for data cleaning and exporting it to a file for future use.

Boxplots Before Outlier Handling:¶

  • Creates boxplots for all numeric features to visually inspect their value distributions and detect outliers.
  • Organizes plots in a grid layout for clarity.
  • Helps understand which variables have extreme values that may need correction.

Outlier Detection and Treatment (IQR Method):¶

  • For each numeric column, calculates the interquartile range (IQR).
  • Defines lower and upper bounds to identify outliers.
  • Clips values outside these bounds to the nearest acceptable values (winsorization).
  • Reduces the influence of extreme values on the analysis or modeling.

Boxplots After Outlier Handling:¶

  • Recreates the boxplots to show the new distribution of numeric features after outliers have been handled.
  • Useful for visual comparison with the earlier boxplots.

Dataset Summary After Cleaning:¶

Displays structure and data types of the final cleaned dataset.

Saving the Cleaned Dataset:¶

Writes the cleaned DataFrame to a CSV file for future use or modeling.

In [5]:
# Boxplot diagrams before handling outliers
print("\n--- Boxplots Before Handling Outliers ---")
num_plots = len(numeric_cols)
cols = 4
rows = (num_plots + cols - 1) // cols
plt.figure(figsize=(5 * cols, 4 * rows))
for i, col in enumerate(numeric_cols):
    plt.subplot(rows, cols, i + 1)
    sns.boxplot(y=df[col])
    plt.title(col, fontsize=10)
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Handling outliers (optional: use IQR method for numeric features)
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    if not outliers.empty:
        print(f"{col}: {len(outliers)} outliers detected")
        # Optional: clip or remove outliers
        df[col] = df[col].clip(lower_bound, upper_bound)

# Boxplot diagrams after handling outliers
print("\n--- Boxplots After Handling Outliers ---")
plt.figure(figsize=(5 * cols, 4 * rows))
for i, col in enumerate(numeric_cols):
    plt.subplot(rows, cols, i + 1)
    sns.boxplot(y=df[col])
    plt.title(col, fontsize=10)
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\n--- Dataset After Cleaning ---")
print(df.info())

# Save cleaned dataset
cleaned_file_path = 'Bike_Sharing_Cleaned.csv'
df.to_csv(cleaned_file_path, index=False)
print(f"\nCleaned dataset saved to: {cleaned_file_path}")
--- Boxplots Before Handling Outliers ---
No description has been provided for this image
holiday: 21 outliers detected
hum: 2 outliers detected
windspeed: 13 outliers detected

--- Boxplots After Handling Outliers ---
No description has been provided for this image
--- Dataset After Cleaning ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      731 non-null    object 
 1   season      731 non-null    int64  
 2   yr          731 non-null    int64  
 3   mnth        731 non-null    int64  
 4   holiday     731 non-null    int64  
 5   weekday     731 non-null    int64  
 6   workingday  731 non-null    int64  
 7   weathersit  731 non-null    int64  
 8   temp        731 non-null    float64
 9   atemp       731 non-null    float64
 10  hum         731 non-null    float64
 11  windspeed   731 non-null    float64
 12  cnt         731 non-null    int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 74.4+ KB
None

Cleaned dataset saved to: Bike_Sharing_Cleaned.csv
In [ ]: