Assignment 4: Creating Reports and Dashboards for Predictive and Prescriptive Analysis¶
Data Exploration and Preparation¶
Part 1: Predictive Analysis with Housing Prices Dataset¶
Preparation¶
Mario Zamudio (NF1002499)¶
University of Niagara Falls Canada
Master of Data Analytics
Data Warehousing/Visualization (CPSC-510-7)
Winter 2025
Ahmed Eltahawi¶
March 25, 2025
Loading Libraries to use¶
In this step, we import all the libraries to use in the process.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Loading Dataset and Basic information¶
In this section, the data is loaded and some preprocessing is done after to review the content of the columns in the next steps.
- Display Configuration: Adjusts the settings to ensure that all columns of the dataset are visible when printed, which is especially helpful for wide datasets.
- Plot Style Setup: Sets a clean and readable visual style for all plots using the Seaborn library.
- Data Loading: Reads a CSV file named Housing Prices Dataset.csv and loads it into a DataFrame for analysis.
- Initial Data Cleaning: Removes any rows that contain missing (NA) values to ensure the initial analysis is performed on complete data.
- Shape Inspection: Displays the number of rows and columns in the dataset to understand its size and structure.
- Column Removal: Drops the first two columns from the dataset, likely because they contain identifiers or metadata that are not useful for analysis.
- Data Type Enforcement: Converts the SalePrice column to a 64-bit integer to ensure consistent numerical formatting and compatibility with later analysis or modeling steps.
- Data Summary: Prints structural information about the dataset, including column names, data types, and non-null counts, helping to confirm data integrity.
- Preview: Shows the first five rows of the dataset to get an overview of the data’s contents and layout.
- Descriptive Statistics: Displays summary statistics (like mean, median, min, max, etc.) for both numerical and categorical variables, rounded to one decimal for better readability.
# Display settings
# Ensures that all columns in the DataFrame will be displayed when printed,
# which is useful for datasets with many columns.
pd.set_option('display.max_columns', None)
# Sets the visual style for seaborn plots to 'whitegrid' for better readability.
sns.set(style="whitegrid")
# Load the dataset
# The dataset is loaded from a CSV file named "Housing Prices Dataset.csv"
# into a Pandas DataFrame called 'df'.
df = pd.read_csv("Housing Prices Dataset.csv")
# 1. Initial EDA (Exploratory Data Analysis)
# Remove rows with any missing values (NA) to ensure clean initial analysis.
# This helps avoid issues in summary statistics or visualizations due to incomplete data.
df = df.dropna()
# Print the shape of the dataset (rows, columns) to understand its size.
print("\n--- Shape of Dataset ---")
print(df.shape)
# Remove the first two columns from the DataFrame.
# These might be identifiers or metadata not needed for analysis.
df = df.iloc[:, 2:]
# Force the 'SalePrice' column to be of type int64 (integer).
# This ensures consistent numeric formatting and compatibility with modeling tasks.
df['SalePrice'] = df['SalePrice'].astype('int64')
# Display basic information about the dataset.
# This includes column names, non-null counts, and data types.
# It's useful for checking the integrity and structure of the data.
print("\n--- Dataset Information ---")
df.info()
# Display the first five rows of the dataset.
# Helps in gaining a quick look at the structure and sample values.
print("\n--- First five rows ---")
print(df.head())
# Summary statistics
# Provides descriptive statistics (count, mean, std, min, max, etc.)
# for both numerical and categorical columns.
# Rounded to 1 decimal for better readability.
print("\n--- Summary Statistics ---")
print(df.describe(include='all').round(1))
--- Shape of Dataset --- (1460, 52) --- Dataset Information --- <class 'pandas.core.frame.DataFrame'> Index: 1460 entries, 0 to 2889 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LotFrontage 1460 non-null int64 1 LotArea 1460 non-null int64 2 OverallQual 1460 non-null int64 3 OverallCond 1460 non-null int64 4 YearBuilt 1460 non-null int64 5 YearRemodAdd 1460 non-null int64 6 MasVnrArea 1460 non-null int64 7 ExterQual 1460 non-null int64 8 ExterCond 1460 non-null int64 9 BsmtQual 1460 non-null int64 10 BsmtCond 1460 non-null int64 11 BsmtExposure 1460 non-null int64 12 BsmtFinType1 1460 non-null int64 13 BsmtFinSF1 1460 non-null int64 14 BsmtFinType2 1460 non-null int64 15 BsmtFinSF2 1460 non-null int64 16 BsmtUnfSF 1460 non-null int64 17 TotalBsmtSF 1460 non-null int64 18 HeatingQC 1460 non-null int64 19 1stFlrSF 1460 non-null int64 20 2ndFlrSF 1460 non-null int64 21 LowQualFinSF 1460 non-null int64 22 GrLivArea 1460 non-null int64 23 BsmtFullBath 1460 non-null int64 24 BsmtHalfBath 1460 non-null int64 25 FullBath 1460 non-null int64 26 HalfBath 1460 non-null int64 27 BedroomAbvGr 1460 non-null int64 28 KitchenAbvGr 1460 non-null int64 29 KitchenQual 1460 non-null int64 30 TotRmsAbvGrd 1460 non-null int64 31 Fireplaces 1460 non-null int64 32 FireplaceQu 1460 non-null int64 33 GarageYrBlt 1460 non-null int64 34 GarageFinish 1460 non-null int64 35 GarageCars 1460 non-null int64 36 GarageArea 1460 non-null int64 37 GarageQual 1460 non-null int64 38 GarageCond 1460 non-null int64 39 PavedDrive 1460 non-null int64 40 WoodDeckSF 1460 non-null int64 41 OpenPorchSF 1460 non-null int64 42 EnclosedPorch 1460 non-null int64 43 3SsnPorch 1460 non-null int64 44 ScreenPorch 1460 non-null int64 45 PoolArea 1460 non-null int64 46 Fence 1460 non-null int64 47 MiscVal 1460 non-null int64 48 YrSold 1460 non-null int64 49 SalePrice 1460 non-null int64 dtypes: int64(50) memory usage: 581.7 KB --- First five rows --- LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd \ 0 21 1596 4 5 1973 1973 1 21 1890 4 7 1972 1972 2 21 1526 4 8 1970 2002 3 21 1491 4 6 1972 1972 4 21 1936 4 6 1970 1970 MasVnrArea ExterQual ExterCond BsmtQual BsmtCond BsmtExposure \ 0 0 2 2 4 3 4 1 0 2 3 4 3 3 2 0 2 3 4 3 3 3 0 2 2 4 3 3 4 0 2 2 4 3 3 BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF \ 0 6 462 1 0 0 462 1 5 495 1 0 135 630 2 6 515 1 0 115 630 3 2 150 6 480 0 630 4 4 131 6 499 0 630 HeatingQC 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath \ 0 2 526 462 0 988 1 1 3 630 0 0 630 1 2 2 630 0 0 630 1 3 4 630 0 0 630 1 4 3 630 0 0 630 1 BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual \ 0 0 1 0 2 1 2 1 0 1 0 1 1 2 2 0 1 0 1 1 3 3 0 1 0 1 1 2 4 0 1 0 1 1 2 TotRmsAbvGrd Fireplaces FireplaceQu GarageYrBlt GarageFinish \ 0 5 0 0 1973 1 1 3 0 0 0 0 2 3 0 0 1970 1 3 3 0 0 0 0 4 3 0 0 0 0 GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF \ 0 1 297 3 3 3 120 1 0 0 0 0 3 88 2 1 286 3 3 3 0 3 0 0 0 0 3 96 4 0 0 0 0 3 0 OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea Fence \ 0 101 0 0 0 0 2 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 24 0 0 0 0 0 4 0 0 0 0 0 3 MiscVal YrSold SalePrice 0 0 2009 91000 1 0 2008 81000 2 0 2009 86000 3 0 2010 75500 4 0 2007 84500 --- Summary Statistics --- LotFrontage LotArea OverallQual OverallCond YearBuilt \ count 1460.0 1460.0 1460.0 1460.0 1460.0 mean 57.6 10516.8 6.1 5.6 1971.3 std 34.7 9981.3 1.4 1.1 30.2 min 0.0 1300.0 1.0 1.0 1872.0 25% 42.0 7553.5 5.0 5.0 1954.0 50% 63.0 9478.5 6.0 5.0 1973.0 75% 79.0 11601.5 7.0 6.0 2000.0 max 313.0 215245.0 10.0 9.0 2010.0 YearRemodAdd MasVnrArea ExterQual ExterCond BsmtQual BsmtCond \ count 1460.0 1460.0 1460.0 1460.0 1460.0 1460.0 mean 1978.5 103.1 2.4 2.1 3.5 2.9 std 25.4 180.7 0.6 0.4 0.9 0.6 min 1900.0 0.0 1.0 0.0 0.0 0.0 25% 1959.0 0.0 2.0 2.0 3.0 3.0 50% 1980.0 0.0 2.0 2.0 4.0 3.0 75% 2003.0 164.2 3.0 2.0 4.0 3.0 max 2010.0 1600.0 4.0 4.0 5.0 4.0 BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 \ count 1460.0 1460.0 1460.0 1460.0 1460.0 mean 1.6 3.5 443.6 1.2 46.5 std 1.1 2.1 456.1 0.9 161.3 min 0.0 0.0 0.0 0.0 0.0 25% 1.0 1.0 0.0 1.0 0.0 50% 1.0 4.0 383.5 1.0 0.0 75% 2.0 6.0 712.2 1.0 0.0 max 4.0 6.0 5644.0 6.0 1474.0 BsmtUnfSF TotalBsmtSF HeatingQC 1stFlrSF 2ndFlrSF LowQualFinSF \ count 1460.0 1460.0 1460.0 1460.0 1460.0 1460.0 mean 567.2 1057.4 3.1 1162.6 347.0 5.8 std 441.9 438.7 1.0 386.6 436.5 48.6 min 0.0 0.0 0.0 334.0 0.0 0.0 25% 223.0 795.8 2.0 882.0 0.0 0.0 50% 477.5 991.5 4.0 1087.0 0.0 0.0 75% 808.0 1298.2 4.0 1391.2 728.0 0.0 max 2336.0 6110.0 4.0 4692.0 2065.0 572.0 GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath \ count 1460.0 1460.0 1460.0 1460.0 1460.0 mean 1515.5 0.4 0.1 1.6 0.4 std 525.5 0.5 0.2 0.6 0.5 min 334.0 0.0 0.0 0.0 0.0 25% 1129.5 0.0 0.0 1.0 0.0 50% 1464.0 0.0 0.0 2.0 0.0 75% 1776.8 1.0 0.0 2.0 1.0 max 5642.0 3.0 2.0 3.0 2.0 BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Fireplaces \ count 1460.0 1460.0 1460.0 1460.0 1460.0 mean 2.9 1.0 2.5 6.5 0.6 std 0.8 0.2 0.7 1.6 0.6 min 0.0 0.0 1.0 2.0 0.0 25% 2.0 1.0 2.0 5.0 0.0 50% 3.0 1.0 2.0 6.0 1.0 75% 3.0 1.0 3.0 7.0 1.0 max 8.0 3.0 4.0 14.0 3.0 FireplaceQu GarageYrBlt GarageFinish GarageCars GarageArea \ count 1460.0 1460.0 1460.0 1460.0 1460.0 mean 1.8 1868.7 1.7 1.8 473.0 std 1.8 453.7 0.9 0.7 213.8 min 0.0 0.0 0.0 0.0 0.0 25% 0.0 1958.0 1.0 1.0 334.5 50% 2.0 1977.0 2.0 2.0 480.0 75% 4.0 2001.0 2.0 2.0 576.0 max 5.0 2010.0 3.0 4.0 1418.0 GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF \ count 1460.0 1460.0 1460.0 1460.0 1460.0 mean 2.8 2.8 2.9 94.2 46.7 std 0.7 0.7 0.5 125.3 66.3 min 0.0 0.0 1.0 0.0 0.0 25% 3.0 3.0 3.0 0.0 0.0 50% 3.0 3.0 3.0 0.0 25.0 75% 3.0 3.0 3.0 168.0 68.0 max 5.0 5.0 3.0 857.0 547.0 EnclosedPorch 3SsnPorch ScreenPorch PoolArea Fence MiscVal \ count 1460.0 1460.0 1460.0 1460.0 1460.0 1460.0 mean 22.0 3.4 15.1 2.8 0.6 43.5 std 61.1 29.3 55.8 40.2 1.2 496.1 min 0.0 0.0 0.0 0.0 0.0 0.0 25% 0.0 0.0 0.0 0.0 0.0 0.0 50% 0.0 0.0 0.0 0.0 0.0 0.0 75% 0.0 0.0 0.0 0.0 0.0 0.0 max 552.0 508.0 480.0 738.0 4.0 15500.0 YrSold SalePrice count 1460.0 1460.0 mean 2007.8 180921.2 std 1.3 79442.5 min 2006.0 34900.0 25% 2007.0 129975.0 50% 2008.0 163000.0 75% 2009.0 214000.0 max 2010.0 755000.0
Checking Missing Values¶
Identifies if there are any missing values in the dataset, which is crucial for data cleaning, after the first check of the data.
# Check for missing values
# Identifies if there are any missing values in the dataset, which is crucial for data cleaning.
print("\nMissing Values:")
print(df.isnull().sum())
Missing Values: LotFrontage 0 LotArea 0 OverallQual 0 OverallCond 0 YearBuilt 0 YearRemodAdd 0 MasVnrArea 0 ExterQual 0 ExterCond 0 BsmtQual 0 BsmtCond 0 BsmtExposure 0 BsmtFinType1 0 BsmtFinSF1 0 BsmtFinType2 0 BsmtFinSF2 0 BsmtUnfSF 0 TotalBsmtSF 0 HeatingQC 0 1stFlrSF 0 2ndFlrSF 0 LowQualFinSF 0 GrLivArea 0 BsmtFullBath 0 BsmtHalfBath 0 FullBath 0 HalfBath 0 BedroomAbvGr 0 KitchenAbvGr 0 KitchenQual 0 TotRmsAbvGrd 0 Fireplaces 0 FireplaceQu 0 GarageYrBlt 0 GarageFinish 0 GarageCars 0 GarageArea 0 GarageQual 0 GarageCond 0 PavedDrive 0 WoodDeckSF 0 OpenPorchSF 0 EnclosedPorch 0 3SsnPorch 0 ScreenPorch 0 PoolArea 0 Fence 0 MiscVal 0 YrSold 0 SalePrice 0 dtype: int64
Reviewing Distribution of Numerical Variables¶
In this piece of code, we check the basic information to detect trends, outliers and correlations to be used in the next steps of the process.
- Identifying Numeric Columns: Selects all columns in the dataset that contain numerical data types to prepare them for visualization.
- Creating Histograms: Generates histograms for each numeric feature to show the distribution of their values (e.g., how frequently values occur within certain ranges). This helps in detecting skewness, peaks, or gaps in the data.
- Adjusting Layout and Titles: Sets a title for the collection of histograms, arranges the plots to fit nicely within the figure, and displays them.
- Generating Correlation Heatmap: Creates a heatmap that visualizes the correlation matrix of the dataset, showing the strength and direction of relationships between numeric features.
- Styling the Heatmap: Applies a color scheme to indicate correlation strength and removes internal annotations to keep the visual clean.
- Displaying the Heatmap: Presents the finalized heatmap chart to help identify patterns or multicollinearity among features.
# Visualizing distributions of numerical features
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[numeric_cols].hist(bins=20, figsize=(15, 10))
plt.suptitle('Distributions of Numeric Features')
plt.tight_layout()
plt.show()
# Correlation heatmap
plt.figure(figsize=(15, 15))
sns.heatmap(df.corr(), annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Data Cleaning¶
The following steps involve coding for data cleaning and exporting it to a file for future use.
Handling Missing Values:¶
- Identifies columns that still contain missing values.
- For categorical columns, it fills missing values with the most frequent category (mode).
- For numerical columns, it fills missing values with the median.
- This ensures the dataset is complete before further analysis.
Boxplots Before Outlier Handling:¶
- Creates boxplots for all numeric features to visually inspect their value distributions and detect outliers.
- Organizes plots in a grid layout for clarity.
- Helps understand which variables have extreme values that may need correction.
Outlier Detection and Treatment (IQR Method):¶
- For each numeric column, calculates the interquartile range (IQR).
- Defines lower and upper bounds to identify outliers.
- Clips values outside these bounds to the nearest acceptable values (winsorization).
- Reduces the influence of extreme values on the analysis or modeling.
Ensuring Integer Data Types:¶
- After clipping, it checks if the original data type of a column was integer.
- If so, it rounds the values and converts them back to integer format to maintain data consistency.
Boxplots After Outlier Handling:¶
- Recreates the boxplots to show the new distribution of numeric features after outliers have been handled.
- Useful for visual comparison with the earlier boxplots.
Dataset Summary After Cleaning:¶
Displays structure and data types of the final cleaned dataset.
Saving the Cleaned Dataset:¶
Writes the cleaned DataFrame to a CSV file for future use or modeling.
# Handle missing values (replace with zero if NA still exists after imputation)
missing_cols = df.columns[df.isnull().any()].tolist()
for col in missing_cols:
if df[col].dtype == 'object':
df[col].fillna(df[col].mode()[0], inplace=True)
else:
df[col].fillna(df[col].median(), inplace=True)
# Boxplot diagrams before handling outliers
print("\n--- Boxplots Before Handling Outliers ---")
num_plots = len(numeric_cols)
cols = 4
rows = (num_plots + cols - 1) // cols
plt.figure(figsize=(5 * cols, 4 * rows))
for i, col in enumerate(numeric_cols):
plt.subplot(rows, cols, i + 1)
sns.boxplot(y=df[col])
plt.title(col, fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Handling outliers (optional: use IQR method for numeric features)
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
if not outliers.empty:
print(f"{col}: {len(outliers)} outliers detected")
# Optional: clip or remove outliers
df[col] = df[col].clip(lower_bound, upper_bound)
# Ensure all numeric columns are integers if they originally were
for col in numeric_cols:
if pd.api.types.is_integer_dtype(df[col].dropna().astype(int)):
df[col] = df[col].round().astype(int)
# Boxplot diagrams after handling outliers
print("\n--- Boxplots After Handling Outliers ---")
plt.figure(figsize=(5 * cols, 4 * rows))
for i, col in enumerate(numeric_cols):
plt.subplot(rows, cols, i + 1)
sns.boxplot(y=df[col])
plt.title(col, fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
print("\n--- Dataset After Cleaning ---")
print(df.info())
# Save cleaned dataset
cleaned_file_path = 'Housing_Prices_Cleaned.csv'
df.to_csv(cleaned_file_path, index=False)
print(f"\nCleaned dataset saved to: {cleaned_file_path}")
--- Boxplots Before Handling Outliers ---
LotFrontage: 16 outliers detected LotArea: 69 outliers detected OverallQual: 2 outliers detected OverallCond: 125 outliers detected YearBuilt: 7 outliers detected MasVnrArea: 98 outliers detected ExterCond: 178 outliers detected BsmtQual: 37 outliers detected BsmtCond: 149 outliers detected BsmtExposure: 134 outliers detected BsmtFinSF1: 7 outliers detected BsmtFinType2: 203 outliers detected BsmtFinSF2: 167 outliers detected BsmtUnfSF: 29 outliers detected TotalBsmtSF: 61 outliers detected 1stFlrSF: 20 outliers detected 2ndFlrSF: 2 outliers detected LowQualFinSF: 26 outliers detected GrLivArea: 31 outliers detected BsmtFullBath: 1 outliers detected BsmtHalfBath: 82 outliers detected BedroomAbvGr: 35 outliers detected KitchenAbvGr: 68 outliers detected TotRmsAbvGrd: 30 outliers detected Fireplaces: 5 outliers detected GarageYrBlt: 81 outliers detected GarageCars: 5 outliers detected GarageArea: 21 outliers detected GarageQual: 149 outliers detected GarageCond: 134 outliers detected PavedDrive: 120 outliers detected WoodDeckSF: 32 outliers detected OpenPorchSF: 77 outliers detected EnclosedPorch: 208 outliers detected 3SsnPorch: 24 outliers detected ScreenPorch: 116 outliers detected PoolArea: 7 outliers detected Fence: 281 outliers detected MiscVal: 52 outliers detected SalePrice: 61 outliers detected --- Boxplots After Handling Outliers ---
--- Dataset After Cleaning --- <class 'pandas.core.frame.DataFrame'> Index: 1460 entries, 0 to 2889 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LotFrontage 1460 non-null int32 1 LotArea 1460 non-null int32 2 OverallQual 1460 non-null int32 3 OverallCond 1460 non-null int32 4 YearBuilt 1460 non-null int32 5 YearRemodAdd 1460 non-null int32 6 MasVnrArea 1460 non-null int32 7 ExterQual 1460 non-null int32 8 ExterCond 1460 non-null int32 9 BsmtQual 1460 non-null int32 10 BsmtCond 1460 non-null int32 11 BsmtExposure 1460 non-null int32 12 BsmtFinType1 1460 non-null int32 13 BsmtFinSF1 1460 non-null int32 14 BsmtFinType2 1460 non-null int32 15 BsmtFinSF2 1460 non-null int32 16 BsmtUnfSF 1460 non-null int32 17 TotalBsmtSF 1460 non-null int32 18 HeatingQC 1460 non-null int32 19 1stFlrSF 1460 non-null int32 20 2ndFlrSF 1460 non-null int32 21 LowQualFinSF 1460 non-null int32 22 GrLivArea 1460 non-null int32 23 BsmtFullBath 1460 non-null int32 24 BsmtHalfBath 1460 non-null int32 25 FullBath 1460 non-null int32 26 HalfBath 1460 non-null int32 27 BedroomAbvGr 1460 non-null int32 28 KitchenAbvGr 1460 non-null int32 29 KitchenQual 1460 non-null int32 30 TotRmsAbvGrd 1460 non-null int32 31 Fireplaces 1460 non-null int32 32 FireplaceQu 1460 non-null int32 33 GarageYrBlt 1460 non-null int32 34 GarageFinish 1460 non-null int32 35 GarageCars 1460 non-null int32 36 GarageArea 1460 non-null int32 37 GarageQual 1460 non-null int32 38 GarageCond 1460 non-null int32 39 PavedDrive 1460 non-null int32 40 WoodDeckSF 1460 non-null int32 41 OpenPorchSF 1460 non-null int32 42 EnclosedPorch 1460 non-null int32 43 3SsnPorch 1460 non-null int32 44 ScreenPorch 1460 non-null int32 45 PoolArea 1460 non-null int32 46 Fence 1460 non-null int32 47 MiscVal 1460 non-null int32 48 YrSold 1460 non-null int32 49 SalePrice 1460 non-null int32 dtypes: int32(50) memory usage: 296.6 KB None Cleaned dataset saved to: Housing_Prices_Cleaned.csv