Group 4 - Winter 2025 Predictive Analytics (DAMO-510-6)¶
- Carlos Borda (NF1003721)
- Renato Leiva (NF1000920)
- Mario Zamudio (NF1002499)
Dataset Sources¶
https://toronto.weatherstats.ca/
https://data.torontopolice.on.ca/datasets/bc4c72a793014a55a674984ef175a6f3_0/explore?location=6.523481%2C-39.819624%2C2.63
DataSet Description¶
- OCC_DATE.1- The date of the occurrence of the traffic collision.
- Year - The year in which the collision occurred.
- NEIGHBOURHOOD_158 - The neighborhood where the traffic collision happened, with its corresponding ID.
- avg_temperature - The average temperature (likely in degrees Celsius) on the day of the collision.
- 0 or 1 snow - A binary indicator (0 or 1) showing whether there was snowfall on the given day.
- avg_wind_speed - The average wind speed (likely in km/h) recorded on that day.
- avg_visibility - The average visibility (likely in meters) on the day of the collision.
- snow - The amount of snowfall (likely in cm) recorded on that day.
- snow_on_ground - The amount of snow accumulated on the ground (likely in cm).
- Number of events - The number of traffic collision events recorded on that day in the specified neighborhood.
- Longitud - The longitude coordinate of the location where the collision occurred.
- Latitude - The latitude coordinate of the location where the collision occurred.
- AVG_Precipitation - The average precipitation (likely in mm) recorded on that day.
- Day_of_the_week - The day of the week (e.g., Monday, Tuesday, etc.) when the collision occurred
DataSet Preparation¶
This script sets up an environment for data analysis, preprocessing, modeling, and evaluation using various regression models, including linear regression, support vector regression, random forest, and gradient boosting. It also incorporates visualization, statistical analysis, and model performance metrics to assess predictions.
Imports and Definitions¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib
import statsmodels.api as sm
import scipy.stats as stats
import sklearn
from sklearn import metrics
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor # Updated
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
Load and Initial configurations¶
- Suppresses unnecessary warnings.
- Loads traffic collision data from a CSV file.
- Inspects the dataset’s structure and data types.
- Cleans column names for easier usage.
- Displays a preview of the dataset.
# Ignore specific warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
# Load the dataset
df = pd.read_csv("Events by Day Traffic Colission Toronto5.csv")
# check the data types of the data
df.info()
# Rename columns to remove spaces and convert to uppercase
df.columns = [col.strip().upper().replace(" ", "_") for col in df.columns]
# Print data sample
df
<class 'pandas.core.frame.DataFrame'> RangeIndex: 75089 entries, 0 to 75088 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OCC_DATE.1 75089 non-null object 1 Year 75089 non-null int64 2 NEIGHBOURHOOD_158 75089 non-null object 3 avg_temperature 75089 non-null float64 4 0 or 1 snow 75089 non-null int64 5 avg_wind_speed 75089 non-null float64 6 avg_visibility 75089 non-null int64 7 snow 74708 non-null float64 8 snow_on_ground 75089 non-null int64 9 Number of events 75089 non-null int64 10 Longitud 75089 non-null float64 11 Latitude 75089 non-null float64 12 AVG_Precipitation 74862 non-null float64 13 Day_of_the_week 75089 non-null object dtypes: float64(6), int64(5), object(3) memory usage: 8.0+ MB
OCC_DATE.1 | YEAR | NEIGHBOURHOOD_158 | AVG_TEMPERATURE | 0_OR_1_SNOW | AVG_WIND_SPEED | AVG_VISIBILITY | SNOW | SNOW_ON_GROUND | NUMBER_OF_EVENTS | LONGITUD | LATITUDE | AVG_PRECIPITATION | DAY_OF_THE_WEEK | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-01-01 | 2022 | St Lawrence-East Bayfront-The Islands (166) | 0.64 | 1 | 18.0 | 12250 | 3.0 | 3 | 5 | -79.375417 | 43.643198 | 2.4 | Saturday |
1 | 2022-01-01 | 2022 | Alderwood (20) | 0.64 | 1 | 18.0 | 12250 | 3.0 | 3 | 1 | -79.548316 | 43.609070 | 2.4 | Saturday |
2 | 2022-01-01 | 2022 | South Riverdale (70) | 0.64 | 1 | 18.0 | 12250 | 3.0 | 3 | 2 | -79.337223 | 43.662688 | 2.4 | Saturday |
3 | 2022-01-01 | 2022 | Kingsview Village-The Westway (6) | 0.64 | 1 | 18.0 | 12250 | 3.0 | 3 | 1 | -79.552151 | 43.695790 | 2.4 | Saturday |
4 | 2022-01-01 | 2022 | Yonge-St.Clair (97) | 0.64 | 1 | 18.0 | 12250 | 3.0 | 3 | 1 | -79.401513 | 43.686701 | 2.4 | Saturday |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
75084 | 2024-06-30 | 2024 | Englemount-Lawrence (32) | 17.70 | 0 | 26.5 | 15250 | 0.0 | 0 | 1 | -79.434161 | 43.729214 | 0.6 | Sunday |
75085 | 2024-06-30 | 2024 | Danforth (66) | 17.70 | 0 | 26.5 | 15250 | 0.0 | 0 | 2 | -79.325398 | 43.682977 | 0.6 | Sunday |
75086 | 2024-06-30 | 2024 | Lansing-Westgate (38) | 17.70 | 0 | 26.5 | 15250 | 0.0 | 0 | 1 | -79.432825 | 43.756630 | 0.6 | Sunday |
75087 | 2024-06-30 | 2024 | Kingsview Village-The Westway (6) | 17.70 | 0 | 26.5 | 15250 | 0.0 | 0 | 1 | -79.545151 | 43.697425 | 0.6 | Sunday |
75088 | 2024-06-30 | 2024 | Yorkdale-Glen Park (31) | 17.70 | 0 | 26.5 | 15250 | 0.0 | 0 | 1 | -79.451306 | 43.723657 | 0.6 | Sunday |
75089 rows × 14 columns
Data Cleaning¶
- Removes unnecessary or redundant columns (e.g., correlated weather variables, date-related columns).
- Identifies missing values before cleaning.
- Replaces missing snowfall (SNOW) values with 0 instead of dropping them.
- Deletes all other rows with missing values to ensure a complete dataset.
- Confirms that missing values are eliminated after cleaning.
# Remove highly correlated variable PRECIPITATION
if 'AVG_PRECIPITATION' in df.columns:
df.drop(columns=['AVG_PRECIPITATION'], inplace=True)
if 'YEAR' in df.columns:
df.drop(columns=['YEAR'], inplace=True)
if 'OCC_DATE.1' in df.columns:
df.drop(columns=['OCC_DATE.1'], inplace=True)
if 'DAY_OF_THE_WEEK' in df.columns:
df.drop(columns=['DAY_OF_THE_WEEK'], inplace=True)
if 'NEIGHBOURHOOD_158' in df.columns:
df.drop(columns=['NEIGHBOURHOOD_158'], inplace=True)
# Identify columns with missing values and remove rows with missing values
missing_values = df.isnull().sum()
columns_with_na = missing_values[missing_values > 0].index.tolist()
print("Columns with missing values before cleaning:", columns_with_na)
# Reemplazar valores nulos en la columna SNOW con 0
df['SNOW'].fillna(0, inplace=True)
df.dropna(inplace=True)
missing_values = df.isnull().sum()
columns_with_na = missing_values[missing_values > 0].index.tolist()
print("Columns with missing values after cleaning:", columns_with_na)
Columns with missing values before cleaning: ['SNOW'] Columns with missing values after cleaning: []
Data Exploration¶
Dataset description¶
- count → Number of non-null values.
- mean → Average of the values.
- std → Standard deviation (spread of the data).
- min → Minimum value.
- 25% (Q1) → First quartile (25% of data is below this value).
- 50% (median, Q2) → Middle value of the dataset.
- 75% (Q3) → Third quartile (75% of data is below this value).
- max → Maximum value.
Use Cases¶
Identify outliers (e.g., very high max values). Check data distribution (e.g., if mean ≠ median, the data may be skewed). Verify missing values (if count is lower than expected).
print(df.describe())
AVG_TEMPERATURE 0_OR_1_SNOW AVG_WIND_SPEED AVG_VISIBILITY \ count 75089.000000 75089.000000 75089.000000 75089.000000 mean 9.396603 0.087217 16.551332 20095.814833 std 10.068853 0.282154 6.417941 4634.173968 min -17.100000 0.000000 5.000000 3200.000000 25% 1.250000 0.000000 11.500000 15250.000000 50% 9.600000 0.000000 15.500000 21700.000000 75% 18.450000 0.000000 20.500000 24100.000000 max 28.650000 1.000000 47.000000 28150.000000 SNOW SNOW_ON_GROUND NUMBER_OF_EVENTS LONGITUD \ count 75089.000000 75089.000000 75089.000000 75089.000000 mean 0.398514 1.420967 1.787159 -79.394091 std 1.934940 4.330962 1.213243 0.100299 min 0.000000 0.000000 1.000000 -79.639247 25% 0.000000 0.000000 1.000000 -79.466036 50% 0.000000 0.000000 1.000000 -79.397319 75% 0.000000 0.000000 2.000000 -79.324371 max 29.000000 32.000000 29.000000 -79.122044 LATITUDE count 75089.000000 mean 43.710761 std 0.053506 min 43.586487 25% 43.665873 50% 43.706910 75% 43.757230 max 43.844027
Basic Visualizations¶
- Creates histograms for up to 12 numerical columns.
- Uses Seaborn’s histplot to visualize the distribution of each variable.
- Includes a KDE curve for a smoothed density estimate.
- Organizes charts in a 4x3 grid for readability.
- Helps identify skewness, normality, and outliers in the dataset.
# Visualization of histograms to validate variable distribution
numeric_columns = df.select_dtypes(include=['number']).columns
plt.figure(figsize=(15, 12))
for i, col in enumerate(numeric_columns[:12]):
plt.subplot(4, 3, i + 1)
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
Correlation Analysis¶
- Extracts numeric columns and removes missing values.
- Checks if data is available before proceeding.
- Computes the correlation matrix between all numeric variables.
- Generates a heatmap to visualize relationships between variables.
- Helps identify strongly correlated features, which may be useful for feature selection in machine
# Check correlation between variables and generate a heatmap
numeric_df = df.select_dtypes(include=['number']).dropna()
if not numeric_df.empty:
plt.figure(figsize=(12,8))
sns.heatmap(numeric_df.corr(), annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()
QQ Plots¶
- Selects numerical columns for normality testing.
- Generates Q-Q plots to visually check normality.
- Applies the Shapiro-Wilk test for a statistical normality check.
- Prints results indicating whether each column follows a normal distribution.
Use Cases¶
Identifying Normal vs. Non-Normal Data (important for choosing statistical tests or ML models). Detecting Skewness & Outliers. Preprocessing Decisions (e.g., applying log transformations if data is non-normal).
# Visualize distributions of numerical features with normality checks
numerical_cols = df.select_dtypes(include=['number']).columns
for col in numerical_cols:
plt.figure(figsize=(12, 6))
# Q-Q plot
plt.subplot(1, 2, 2)
stats.probplot(df[col], dist="norm", plot=plt)
plt.title(f'Q-Q Plot of {col}')
plt.tight_layout()
plt.show()
# Shapiro-Wilk test for normality
shapiro_test = stats.shapiro(df[col].dropna()) # Drop NA values for the test
print(f"Shapiro-Wilk Test for {col}:")
print(f"Statistic: {shapiro_test.statistic:.3f}, p-value: {shapiro_test.pvalue:.3f}")
if shapiro_test.pvalue > 0.05:
print(f"The data for '{col}' likely comes from a normal distribution.")
else:
print(f"The data for '{col}' likely does not come from a normal distribution.")
print("---")
Shapiro-Wilk Test for AVG_TEMPERATURE: Statistic: 0.968, p-value: 0.000 The data for 'AVG_TEMPERATURE' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for 0_OR_1_SNOW: Statistic: 0.316, p-value: 0.000 The data for '0_OR_1_SNOW' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for AVG_WIND_SPEED: Statistic: 0.956, p-value: 0.000 The data for 'AVG_WIND_SPEED' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for AVG_VISIBILITY: Statistic: 0.793, p-value: 0.000 The data for 'AVG_VISIBILITY' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for SNOW: Statistic: 0.212, p-value: 0.000 The data for 'SNOW' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for SNOW_ON_GROUND: Statistic: 0.378, p-value: 0.000 The data for 'SNOW_ON_GROUND' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for NUMBER_OF_EVENTS: Statistic: 0.674, p-value: 0.000 The data for 'NUMBER_OF_EVENTS' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for LONGITUD: Statistic: 0.990, p-value: 0.000 The data for 'LONGITUD' likely does not come from a normal distribution. ---
Shapiro-Wilk Test for LATITUDE: Statistic: 0.974, p-value: 0.000 The data for 'LATITUDE' likely does not come from a normal distribution. ---
Regression Analysis¶
- Defines independent (X) and dependent (y) variables.
- Ensures data alignment by keeping only matching rows.
- Adds a constant for regression.
- Performs multiple linear regression using statsmodels.
- Handles cases where there are not enough data points.
Use Cases¶
Understanding the impact of multiple factors on NUMBER_OF_EVENTS. Identifying significant predictors using p-values. Assessing the model’s performance via R-squared. Optimizing feature selection to improve prediction accuracy.
# Define variable independents X and variable dependent y
y = pd.to_numeric(df['NUMBER_OF_EVENTS'], errors='coerce')
X = df.drop(columns=['NUMBER_OF_EVENTS'], errors='ignore')
# Align the values
X, y = X.align(y, join='inner', axis=0)
# Multiple Regression Analysis
X_const = sm.add_constant(X)
if X_const.shape[0] > X_const.shape[1]:
model = sm.OLS(y, X_const).fit()
print(model.summary())
else:
print("Not enough data points for multiple regression analysis.")
OLS Regression Results ============================================================================== Dep. Variable: NUMBER_OF_EVENTS R-squared: 0.007 Model: OLS Adj. R-squared: 0.007 Method: Least Squares F-statistic: 63.94 Date: Mon, 10 Mar 2025 Prob (F-statistic): 5.64e-105 Time: 20:30:06 Log-Likelihood: -1.2081e+05 No. Observations: 75089 AIC: 2.416e+05 Df Residuals: 75080 BIC: 2.417e+05 Df Model: 8 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- const -20.2314 6.785 -2.982 0.003 -33.531 -6.932 AVG_TEMPERATURE 0.0030 0.001 5.610 0.000 0.002 0.004 0_OR_1_SNOW 0.0681 0.023 2.941 0.003 0.023 0.113 AVG_WIND_SPEED -0.0011 0.001 -1.586 0.113 -0.003 0.000 AVG_VISIBILITY -1.274e-06 1.08e-06 -1.183 0.237 -3.38e-06 8.36e-07 SNOW 0.0337 0.003 10.999 0.000 0.028 0.040 SNOW_ON_GROUND 0.0025 0.001 2.117 0.034 0.000 0.005 LONGITUD 0.2118 0.049 4.292 0.000 0.115 0.309 LATITUDE 0.8883 0.093 9.603 0.000 0.707 1.070 ============================================================================== Omnibus: 45208.571 Durbin-Watson: 1.686 Prob(Omnibus): 0.000 Jarque-Bera (JB): 726301.185 Skew: 2.613 Prob(JB): 0.00 Kurtosis: 17.312 Cond. No. 3.17e+07 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.17e+07. This might indicate that there are strong multicollinearity or other numerical problems.
Model Preparation and Training¶
Splitting DataSet¶
- Ensures that the model is trained on one part of the data (X_train, y_train)
- Tests the model’s performance on unseen data (X_test, y_test)
- Prevents overfitting, ensuring that the model generalizes well to new data
- Uses a random seed (random_state=42) to make the split consistent across runs
Use Cases¶
Machine Learning Model Training (e.g., regression, classification). Model Evaluation using testing data. Hyperparameter Tuning with validation sets.
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Defining Models to Train and Test¶
- Defines three different machine learning models (Random Forest, Gradient Boosting, Linear Regression).
- Uses optimized hyperparameters for better performance.
- Prepares for model training and evaluation.
- Initializes an empty results dictionary to store model performance metrics.
# Training multiple models with optimized parameters
models = {
"Random Forest": RandomForestRegressor(n_estimators=200, max_depth=10, random_state=42),
"Gradient Boosting": GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42),
"Linear Regression": LinearRegression()
}
results = {}
Function to evaluate the Models¶
- Computes regression metrics (MSE, RMSE, R²) to evaluate performance.
- Creates a scatter plot to visualize predicted vs. actual values.
- Discretizes continuous predictions into bins to analyze accuracy.
- Generates a confusion matrix for regression by comparing bins.
- Uses a heatmap to display the confusion matrix.
Use Cases¶
Evaluating regression models to check their predictive performance. Visualizing prediction errors (scatter plot, confusion matrix). Comparing models to see which has the lowest error.
def evaluate_model(model, X_test, y_test):
num_bins=4
y_pred = model.predict(X_test)
# Regression metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
print(f'R-squared (R2): {r2:.4f}')
# ... (Optional: Scatter plot of predicted vs. actual values) ...
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")
plt.show()
# Import confusion_matrix if not already imported
from sklearn.metrics import confusion_matrix
# Discretize predictions and actual values into bins
# Added duplicates='drop' to handle non-unique bin edges
y_test_bins = pd.qcut(y_test, q=num_bins, labels=False, duplicates='drop')
y_pred_bins = pd.qcut(y_pred, q=num_bins, labels=False, duplicates='drop')
# Create confusion matrix
cm = confusion_matrix(y_test_bins, y_pred_bins)
# Plotting the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Bins")
plt.ylabel("Actual Bins")
plt.title("Confusion Matrix for Regression")
plt.show()
Model Evaluation¶
- Trains multiple models on the training dataset.
- Evaluates each model using regression metrics, scatter plots, and a confusion matrix.
- Saves each trained model as a .pkl file.
- Stores the RMSE for each model to compare their performance.
Use Cases¶
Comparing different models (Random Forest vs. Gradient Boosting vs. Linear Regression). Selecting the best model based on RMSE. Saving trained models to use later without retraining.
# Evaluation Models
for name, model in models.items():
# Changed line below: Pass X_train and y_train as arguments to fit
model.fit(X_train, y_train)
print(f"\n{name} Model Evaluation")
evaluate_model(model, X_test, y_test)
joblib.dump(model, f'{name.replace(" ", "_").lower()}_model.pkl')
results[name] = np.sqrt(mean_squared_error(y_test, model.predict(X_test))) # Store RMSE
Random Forest Model Evaluation Mean Squared Error (MSE): 0.9550 Root Mean Squared Error (RMSE): 0.9772 R-squared (R2): 0.3503
Gradient Boosting Model Evaluation Mean Squared Error (MSE): 1.1241 Root Mean Squared Error (RMSE): 1.0602 R-squared (R2): 0.2353
Linear Regression Model Evaluation Mean Squared Error (MSE): 1.4583 Root Mean Squared Error (RMSE): 1.2076 R-squared (R2): 0.0080
Model Comparition¶
- Creates a bar chart comparing RMSE values for different models.
- Uses different colors for better visualization.
- Labels axes & rotates model names for clarity.
- Helps identify the best model (the one with the lowest RMSE).
Use Cases¶
Selecting the best model based on RMSE. Visualizing model performance differences. Determining whether additional tuning is needed.
# 20. Model comparison
plt.figure(figsize=(8,5))
plt.bar(results.keys(), results.values(), color=['blue', 'green', 'red', 'purple', 'gray'])
plt.xlabel("Model")
plt.ylabel("RMSE") # Change y-axis label
plt.title("Model Performance Comparison (RMSE)") # Update title
plt.xticks(rotation=45)
plt.show()
Using the Models¶
- Randomly selects 5 test samples from X_test.
- Prints the selected sample data for verification.
- Loads each trained model from its saved .pkl file.
- Generates and prints predictions from each model.
Use Cases¶
Testing model predictions on unseen data. Comparing different model outputs on the same sample. Verifying that saved models work correctly after being loaded.
# Test with sample input data
random_indices = np.random.choice(len(X_test), 5, replace=False)
sample_data = X_test.iloc[random_indices]
print("Sample input data:")
print(sample_data)
for name, model in models.items():
loaded_model = joblib.load(f'{name.replace(" ", "_").lower()}_model.pkl')
predictions = loaded_model.predict(sample_data)
print(f"\nPredictions from {name}:", predictions)
Sample input data: AVG_TEMPERATURE 0_OR_1_SNOW AVG_WIND_SPEED AVG_VISIBILITY SNOW \ 10817 17.70 0 13.5 24100 0.0 17820 21.70 0 11.0 20100 0.0 9311 10.70 0 16.5 24100 0.0 34908 2.15 0 28.0 24100 0.0 1295 -13.75 0 18.0 24100 0.0 SNOW_ON_GROUND LONGITUD LATITUDE 10817 0 -79.449384 43.676476 17820 0 -79.401234 43.660649 9311 0 -79.297861 43.776520 34908 1 -79.560010 43.646340 1295 25 -79.459948 43.674131 Predictions from Random Forest: [1.61907005 1.6215865 1.67272156 1.13491531 1.6276811 ] Predictions from Gradient Boosting: [1.56582782 1.59356405 2.44570647 1.27388694 1.64345978] Predictions from Linear Regression: [1.74406004 1.76640872 1.84151488 1.62656155 1.68973751]