!pip install folium
!pip install branca

Requirement already satisfied: folium in c:\users\mzamu\anaconda3\lib\site-packages (0.20.0)
Requirement already satisfied: branca>=0.6.0 in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (0.8.1)
Requirement already satisfied: jinja2>=2.9 in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (3.1.4)
Requirement already satisfied: numpy in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (1.26.4)
Requirement already satisfied: requests in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (2.32.3)
Requirement already satisfied: xyzservices in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (2022.9.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\mzamu\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (2025.4.26)
Requirement already satisfied: branca in c:\users\mzamu\anaconda3\lib\site-packages (0.8.1)
Requirement already satisfied: jinja2>=3 in c:\users\mzamu\anaconda3\lib\site-packages (from branca) (3.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\mzamu\anaconda3\lib\site-packages (from jinja2>=3->branca) (2.1.3)

# Data manipulation libraries
import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Machine learning libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix, 
    f1_score,
    precision_score, 
    recall_score, 
    roc_auc_score,
    RocCurveDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load datasets
crime_df = pd.read_csv('crime_dataset.csv')
weapon_types_df = pd.read_csv('weapon_types.csv')
crime_types_df = pd.read_csv('crime_types.csv')

# Quick preview
print("Crime Dataset:")
display(crime_df.shape)
display(crime_df.head())

print("\nWeapon Types:")
display(weapon_types_df.shape)
display(weapon_types_df.head())

print("\nCrime Types:")
display(crime_types_df.shape)
display(crime_types_df.head())

Crime Dataset:

(203089, 25)

Weapon Types:

(73, 2)

Crime Types:

(133, 2)

# Check missing values in each dataset
print("Missing values in crime_df:\n", crime_df.isnull().sum())
print("\nMissing values in weapon_types_df:\n", weapon_types_df.isnull().sum())
print("\nMissing values in crime_types_df:\n", crime_types_df.isnull().sum())

Missing values in crime_df:
 division_number             0
date_reported               0
date_occurred               0
area                        0
area_name                   0
reporting_district          0
part                        0
crime_code                  0
modus_operandi          28929
victim_age               4062
victim_gender           27729
victim_ethnicity        27731
premise_code                3
premise_description        92
weapon_code            131843
crime_code_1                2
crime_code_2           187239
crime_code_3           202560
crime_code_4           203071
incident_admincode          0
location                    0
cross_street           167414
latitude                    0
longitude                   0
case_solved                 0
dtype: int64

Missing values in weapon_types_df:
 weapon_code           0
weapon_description    0
dtype: int64

Missing values in crime_types_df:
 crime_code           0
crime_description    0
dtype: int64

# Fill missing categorical data with a placeholder
for col in ['victim_gender', 'victim_ethnicity', 'modus_operandi', 
            'premise_description', 'cross_street']:
    crime_df[col] = crime_df[col].fillna('Unknown')

# Fill missing weapon_code data with a 500 this is the value for be categorized based on the weapon types dataset
crime_df[col] = crime_df['weapon_code'].fillna(500)

# For numeric columns 
crime_df = crime_df.dropna(subset=['victim_age', 'latitude', 'longitude'])

# Columns with excessive missing values 
threshold = 0.90  # Drop columns with >90% missing
cols_to_drop = [col for col in crime_df.columns if crime_df[col].isna().mean() > threshold]
crime_df = crime_df.drop(columns=cols_to_drop)
print("Dropped columns due to excessive missing values:", cols_to_drop)

crime_df = crime_df.dropna()

Dropped columns due to excessive missing values: ['crime_code_2', 'crime_code_3', 'crime_code_4']

print("Missing values in crime_df:\n", crime_df.isnull().sum())
crime_df.shape

Missing values in crime_df:
 division_number        0
date_reported          0
date_occurred          0
area                   0
area_name              0
reporting_district     0
part                   0
crime_code             0
modus_operandi         0
victim_age             0
victim_gender          0
victim_ethnicity       0
premise_code           0
premise_description    0
weapon_code            0
crime_code_1           0
incident_admincode     0
location               0
cross_street           0
latitude               0
longitude              0
case_solved            0
dtype: int64

(69823, 22)

# Check for duplicates
print("Duplicates in crime_df:", crime_df.duplicated().sum())
print("Duplicates in weapon_types_df:", weapon_types_df.duplicated().sum())
print("Duplicates in crime_types_df:", crime_types_df.duplicated().sum())

Duplicates in crime_df: 0
Duplicates in weapon_types_df: 0
Duplicates in crime_types_df: 0

# Remove duplicates
crime_df = crime_df.drop_duplicates()
weapon_types_df = weapon_types_df.drop_duplicates()
crime_types_df = crime_types_df.drop_duplicates()

# Check data types

print("Crime Dataset:")
print(crime_df.dtypes)

print("\nWeapon Types:")
print(weapon_types_df.dtypes)

print("\nCrime Types:")
print(crime_types_df.dtypes)

Crime Dataset:
division_number          int64
date_reported           object
date_occurred           object
area                     int64
area_name               object
reporting_district       int64
part                     int64
crime_code               int64
modus_operandi          object
victim_age             float64
victim_gender           object
victim_ethnicity        object
premise_code           float64
premise_description     object
weapon_code            float64
crime_code_1           float64
incident_admincode       int64
location                object
cross_street           float64
latitude               float64
longitude              float64
case_solved             object
dtype: object

Weapon Types:
weapon_code            int64
weapon_description    object
dtype: object

Crime Types:
crime_code            int64
crime_description    object
dtype: object

# Convert date columns
crime_df['date_reported'] = pd.to_datetime(crime_df['date_reported'], errors='coerce')
crime_df['date_occurred'] = pd.to_datetime(crime_df['date_occurred'], errors='coerce')

# Ensure categorical types
crime_df['division_number'] = crime_df['division_number'].astype('category')
crime_df['area'] = crime_df['area'].astype('category')
crime_df['reporting_district'] = crime_df['reporting_district'].astype('category')
crime_df['part'] = crime_df['part'].astype('category')
crime_df['crime_code'] = crime_df['crime_code'].astype('category')
crime_df['premise_code'] = crime_df['premise_code'].astype('category')
crime_df['crime_code_1'] = crime_df['crime_code_1'].astype('category')
crime_df['incident_admincode'] = crime_df['incident_admincode'].astype('category')
crime_df['weapon_code'] = crime_df['weapon_code'].astype('category')
crime_df['area_name'] = crime_df['area_name'].astype('category')
crime_df['case_solved'] = crime_df['case_solved'].astype('category')
crime_df['premise_description'] = crime_df['premise_description'].astype('category')
crime_df['victim_ethnicity'] = crime_df['victim_ethnicity'].astype('category')
crime_df['victim_gender'] = crime_df['victim_gender'].astype('category')

weapon_types_df['weapon_code'] = weapon_types_df['weapon_code'].astype('category')
weapon_types_df['weapon_description'] = weapon_types_df['weapon_description'].astype('category')
crime_types_df['crime_description'] = crime_types_df['crime_description'].astype('category')
crime_types_df['crime_code'] = crime_types_df['crime_code'].astype('category')

# Ensure number types
crime_df['victim_age'] = crime_df['victim_age'].astype(int)
crime_df['latitude'] = pd.to_numeric(crime_df['latitude'], errors='coerce')
crime_df['longitude'] = pd.to_numeric(crime_df['longitude'], errors='coerce')

weapon_types_df = weapon_types_df.set_index('weapon_code')
crime_types_df = crime_types_df.set_index('crime_code')

print("Crime Dataset:")
print(crime_df.dtypes)

print("\nWeapon Types:")
print(weapon_types_df.dtypes)

print("\nCrime Types:")
print(crime_types_df.dtypes)

Crime Dataset:
division_number              category
date_reported          datetime64[ns]
date_occurred          datetime64[ns]
area                         category
area_name                    category
reporting_district           category
part                         category
crime_code                   category
modus_operandi                 object
victim_age                      int32
victim_gender                category
victim_ethnicity             category
premise_code                 category
premise_description          category
weapon_code                  category
crime_code_1                 category
incident_admincode           category
location                       object
cross_street                  float64
latitude                      float64
longitude                     float64
case_solved                  category
dtype: object

Weapon Types:
weapon_description    category
dtype: object

Crime Types:
crime_description    category
dtype: object

# Loop through all categorical columns and show top 10 most frequent values
categorical_cols = crime_df.select_dtypes(include=['category', 'object']).columns

for col in categorical_cols:
    print(f"\nColumn: {col}")
    value_counts = crime_df[col].value_counts(dropna=False)
    top_10 = value_counts.head(10)
    print("Top 10 most frequent values:")
    print(top_10)
    print(f"Total unique values: {crime_df[col].nunique(dropna=True)}")

Column: division_number
Top 10 most frequent values:
division_number
201600896    1
211406652    1
211406712    1
211406711    1
211406710    1
211406707    1
211406704    1
211406673    1
211406645    1
211407280    1
Name: count, dtype: int64
Total unique values: 69823

Column: area
Top 10 most frequent values:
area
12    6259
1     5224
18    4757
6     4674
3     4473
13    4033
2     3904
20    3764
14    3325
5     2971
Name: count, dtype: int64
Total unique values: 21

Column: area_name
Top 10 most frequent values:
area_name
77th Street    6259
Central        5224
Southeast      4757
Hollywood      4674
Southwest      4473
Newton         4033
Rampart        3904
Olympic        3764
Pacific        3325
Harbor         2971
Name: count, dtype: int64
Total unique values: 21

Column: reporting_district
Top 10 most frequent values:
reporting_district
645     520
636     462
646     403
1822    345
162     333
119     317
245     305
1241    290
1268    287
666     286
Name: count, dtype: int64
Total unique values: 1138

Column: part
Top 10 most frequent values:
part
2    37675
1    32148
Name: count, dtype: int64
Total unique values: 2

Column: crime_code
Top 10 most frequent values:
crime_code
624    15410
230    12168
626    10216
210     7064
930     4042
761     3218
236     2912
740     1386
310     1200
220     1093
Name: count, dtype: int64
Total unique values: 102

Column: modus_operandi
Top 10 most frequent values:
modus_operandi
0416         687
0416 1822    300
0329         243
0421         236
0400         235
1822 0416    225
1501         190
0416 0913    186
0400 0416    179
0444         177
Name: count, dtype: int64
Total unique values: 51648

Column: victim_gender
Top 10 most frequent values:
victim_gender
M          31070
F          28735
X           4052
Male        3116
Female      2831
Unknown       11
H              8
Name: count, dtype: int64
Total unique values: 7

Column: victim_ethnicity
Top 10 most frequent values:
victim_ethnicity
H       31804
B       15659
W       11550
O        4658
X        4587
A        1348
K         109
F          41
Rare       18
I          15
Name: count, dtype: int64
Total unique values: 14

Column: premise_code
Top 10 most frequent values:
premise_code
101.0    15431
501.0    12222
502.0    11612
102.0     6924
108.0     4123
203.0     2732
210.0      971
122.0      965
103.0      818
109.0      747
Name: count, dtype: int64
Total unique values: 264

Column: premise_description
Top 10 most frequent values:
premise_description
STREET                                          15431
SINGLE FAMILY DWELLING                          12222
MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)    11612
SIDEWALK                                         6924
PARKING LOT                                      4123
OTHER BUSINESS                                   2732
RESTAURANT/FAST FOOD                              971
VEHICLE, PASSENGER/TRUCK                          965
ALLEY                                             818
PARK/PLAYGROUND                                   747
Name: count, dtype: int64
Total unique values: 263

Column: weapon_code
Top 10 most frequent values:
weapon_code
400.0    37044
500.0     6943
511.0     4963
102.0     4702
109.0     1639
106.0     1581
200.0     1540
207.0     1229
512.0      815
307.0      756
Name: count, dtype: int64
Total unique values: 73

Column: crime_code_1
Top 10 most frequent values:
crime_code_1
624.0    15479
230.0    12170
626.0    10277
210.0     7066
930.0     4023
761.0     3095
236.0     2913
740.0     1418
310.0     1203
220.0     1093
Name: count, dtype: int64
Total unique values: 102

Column: incident_admincode
Top 10 most frequent values:
incident_admincode
0    41187
1    28636
Name: count, dtype: int64
Total unique values: 2

Column: location
Top 10 most frequent values:
location
6TH                                       242
7TH                                       237
800 N  ALAMEDA                      ST    221
7TH                          ST           219
6TH                          ST           209
WESTERN                      AV           190
5TH                                       179
HOLLYWOOD                                 171
BROADWAY                                  167
FIGUEROA                     ST           166
Name: count, dtype: int64
Total unique values: 20541

Column: case_solved
Top 10 most frequent values:
case_solved
Not solved    41275
Solved        28548
Name: count, dtype: int64
Total unique values: 2

crime_df = crime_df.drop(columns=['modus_operandi','division_number','reporting_district', 'cross_street', 'incident_admincode','location'])

crime_df = crime_df.drop_duplicates()

# Normalizing values for 'victim_gender'
gender_map = {
    'M': 'Male',
    'F': 'Female',
    'Male': 'Male',
    'Female': 'Female',
    'X': 'Non-binary',
    'H': 'Homosexual', 
    'Unknown': 'Unknown'
}
crime_df['victim_gender'] = crime_df['victim_gender'].map(gender_map).fillna('Unknown')

print(crime_df['victim_gender'].unique())

['Female' 'Non-binary' 'Male' 'Homosexual' 'Unknown']

# Ethnicity mapping (adjust as needed for your assignment or codebook)
ethnicity_map = {
    'W': 'White',
    'B': 'Black',
    'H': 'Hispanic',
    'A': 'Asian',
    'C': 'Chinese',
    'I': 'Indigenous',
    'K': 'Korean',    # Without a codebook, group as 'Other'
    'F': 'Filipino',
    'V': 'Vietnamese',
    'J': 'Japanese',
    'Z': 'Other',
    'O': 'Other',
    'X': 'Unknown',
    'Unknown': 'Unknown',
    'Rare': 'Other'
}

crime_df['victim_ethnicity'] = crime_df['victim_ethnicity'].map(ethnicity_map).fillna('Unknown')

# Check new unique values
print(crime_df['victim_ethnicity'].unique())

['Black' 'Unknown' 'White' 'Hispanic' 'Other' 'Asian' 'Indigenous'
 'Filipino' 'Korean' 'Chinese' 'Vietnamese' 'Japanese']

crime_df.head(5)

weapon_types_df.head(5)

crime_types_df.head(5)

print(weapon_types_df.columns)
print(crime_types_df.columns)

Index(['weapon_description'], dtype='object')
Index(['crime_description'], dtype='object')

# Merge with weapon types
crime_df = crime_df.merge(weapon_types_df, on='weapon_code', how='left')
# Merge with crime types
crime_df = crime_df.merge(crime_types_df, on='crime_code', how='left')

crime_df['weapon_code'] = crime_df['weapon_code'].astype('category')
crime_df['crime_code'] = crime_df['crime_code'].astype('category')

# Preview merged columns
print(crime_df[['weapon_code', 'weapon_description', 'crime_code', 'crime_description']].head())

  weapon_code                              weapon_description crime_code  \
0       400.0  STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)        210   
1       500.0                     UNKNOWN WEAPON/OTHER WEAPON        745   
2       500.0                     UNKNOWN WEAPON/OTHER WEAPON        310   
3       500.0                     UNKNOWN WEAPON/OTHER WEAPON        210   
4       102.0                                        HAND GUN        210   

                          crime_description  
0                                   ROBBERY  
1  VANDALISM - MISDEAMEANOR ($399 OR UNDER)  
2                                  BURGLARY  
3                                   ROBBERY  
4                                   ROBBERY

# Summary statistics for numeric columns
print("Numeric Summary Statistics:")
display(crime_df.describe())

Numeric Summary Statistics:

def get_iqr_bounds(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return lower, upper

lat_lower, lat_upper = get_iqr_bounds(crime_df['latitude'])
lon_lower, lon_upper = get_iqr_bounds(crime_df['longitude'])
print(f"Latitude bounds: {lat_lower:.5f} to {lat_upper:.5f}")
print(f"Longitude bounds: {lon_lower:.5f} to {lon_upper:.5f}")

Latitude bounds: 33.82665 to 34.28065
Longitude bounds: -118.59510 to -118.07430

outliers = crime_df[
    (crime_df['latitude'] < lat_lower) | (crime_df['latitude'] > lat_upper) |
    (crime_df['longitude'] < lon_lower) | (crime_df['longitude'] > lon_upper)
]

print(f"Total outliers found: {outliers.shape[0]}")

Total outliers found: 5615

crime_df = crime_df[
    (crime_df['latitude'] >= lat_lower) & (crime_df['latitude'] <= lat_upper) &
    (crime_df['longitude'] >= lon_lower) & (crime_df['longitude'] <= lon_upper)
].copy()

print(f"Rows after removing outliers: {crime_df.shape[0]}")

Rows after removing outliers: 63864

# Summary statistics for categorical/object columns
print("Categorical Summary Statistics:")
display(crime_df.describe(include=['category', 'object']))

Categorical Summary Statistics:

# Solve rate
print("Case Solved Rate:")
display(crime_df['case_solved'].value_counts(normalize=True))

# Top 5 most common crime types
print("Top 5 Crime Types:")
display(crime_df['crime_description'].value_counts().head())

# Top 5 areas by crime count
print("Top 5 Areas:")
display(crime_df['area_name'].value_counts().head())

Case Solved Rate:

case_solved
Not solved    0.597379
Solved        0.402621
Name: proportion, dtype: float64

Top 5 Crime Types:

crime_description
BATTERY - SIMPLE ASSAULT                          13980
ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT    11190
INTIMATE PARTNER - SIMPLE ASSAULT                  9122
ROBBERY                                            6661
CRIMINAL THREATS - NO WEAPON DISPLAYED             3655
Name: count, dtype: int64

Top 5 Areas:

area_name
77th Street    6189
Central        5138
Southeast      4691
Hollywood      4582
Southwest      4439
Name: count, dtype: int64

# Victim Age Distribution
plt.figure(figsize=(8, 4))
sns.histplot(crime_df['victim_age'].dropna(), bins=50, kde=True)
plt.title('Distribution of Victim Age')
plt.xlabel('Victim Age')
plt.ylabel('Count')
plt.xlim(0, 100)
plt.xticks(range(0, 101, 5), rotation=45)
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.show()

plt.figure(figsize=(10, 5))
order = crime_df['victim_ethnicity'].value_counts().index

# Get counts and percentages
counts = crime_df['victim_ethnicity'].value_counts()
percentages = counts / counts.sum() * 100

ax = sns.countplot(
    y='victim_ethnicity',
    data=crime_df,
    order=order,
    hue='victim_ethnicity',
    palette='Blues_r',
    legend=False
)
plt.title('Distribution of Victim Ethnicity')
plt.xlabel('Count')
plt.ylabel('Victim Ethnicity')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)
plt.tight_layout()

plt.show()

plt.figure(figsize=(7, 4))
order = crime_df['victim_gender'].value_counts().index

# Get counts and percentages
counts = crime_df['victim_gender'].value_counts()
percentages = counts / counts.sum() * 100

ax = sns.countplot(
    y='victim_gender',
    data=crime_df,
    order=order,
    hue='victim_gender',
    palette='Blues_r',
    legend=False
)
plt.title('Distribution of Victim Gender')
plt.xlabel('Count')
plt.ylabel('Victim Gender')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.tight_layout()
plt.show()

print(crime_df['crime_description'].isnull().sum())
print(crime_df['crime_description'].value_counts().head(10))

0
crime_description
BATTERY - SIMPLE ASSAULT                                   13980
ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT             11190
INTIMATE PARTNER - SIMPLE ASSAULT                           9122
ROBBERY                                                     6661
CRIMINAL THREATS - NO WEAPON DISPLAYED                      3655
BRANDISH WEAPON                                             2925
INTIMATE PARTNER - AGGRAVATED ASSAULT                       2614
VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)     1363
BURGLARY                                                    1149
ATTEMPTED ROBBERY                                           1029
Name: count, dtype: int64

# Calculate and print the top 10 for verification

top_crimes = crime_df['crime_description'].str.strip().value_counts().head(10)

top_crimes_df = top_crimes.reset_index()
top_crimes_df.columns = ['crime_description', 'count']

plt.figure(figsize=(10, 5))
sns.barplot(data=top_crimes_df, y='crime_description', x='count')
plt.title('Top 10 Crime Types')
plt.xlabel('Number of Crimes')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.ylabel('Crime Type')
plt.tight_layout()
plt.show()

# Crimes by Area (Top 10)
plt.figure(figsize=(10, 5))
top_areas = crime_df['area_name'].str.strip().value_counts().head(10)
sns.barplot(y=top_areas.index, x=top_areas.values)
plt.title('Top 10 Areas by Crime Count')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.xlabel('Number of Crimes')
plt.ylabel('Area')
plt.tight_layout()
plt.show()

# Select only numeric columns (drop non-numeric ones)
numeric_cols = crime_df.select_dtypes(include='number')

plt.figure(figsize=(8, 6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Heatmap for Numeric Features')
plt.show()

# Extract date time values from the date occurred
crime_df['year'] = crime_df['date_occurred'].dt.year
crime_df['month'] = crime_df['date_occurred'].dt.month
crime_df['day_of_week'] = crime_df['date_occurred'].dt.day_name()
crime_df['month_name'] = crime_df['date_occurred'].dt.month_name()
crime_df['hour'] = crime_df['date_occurred'].dt.hour

# Crimes per month (over all years)
monthly_counts = crime_df.groupby(['year', 'month']).size().reset_index(name='crime_count')
monthly_counts['year_month'] = monthly_counts['year'].astype(str) + '-' + monthly_counts['month'].astype(str).str.zfill(2)

plt.figure(figsize=(14, 5))
plt.plot(monthly_counts['year_month'], monthly_counts['crime_count'], marker='o')
plt.title('Crimes per Month')
plt.xlabel('Year-Month')
plt.ylabel('Number of Crimes')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# Count crimes per month (all years)
month_counts = crime_df['month_name'].value_counts().reindex([
    'January','February','March','April','May','June','July','August','September','October','November','December'
])

plt.figure(figsize=(10,5))
sns.barplot(x=month_counts.index, y=month_counts.values)
plt.title('Total Crimes by Month (All Years Combined)')
plt.ylabel('Number of Crimes')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.show()

plt.figure(figsize=(8, 4))
sns.countplot(data=crime_df, x='day_of_week', order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.title('Crime Count by Day of the Week')
plt.ylabel('Number of Crimes')
plt.xlabel('Day of the Week')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.show()

# Count crimes per hour
hour_counts = crime_df['hour'].value_counts().sort_index()

plt.figure(figsize=(10,5))
sns.barplot(x=hour_counts.index, y=hour_counts.values, hue=hour_counts.index, legend=False, palette='viridis')
plt.title('Crime Count by Hour of the Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Crimes')
plt.xticks(range(0,24))
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.tight_layout()
plt.show()

# Day of week
weekday_counts = crime_df['day_of_week'].value_counts()
weekday_max = weekday_counts.idxmax()
weekday_min = weekday_counts.idxmin()

# Hour of day
hour_counts = crime_df['hour'].value_counts()
hour_max = hour_counts.idxmax()
hour_min = hour_counts.idxmin()

# Month
month_counts = crime_df['month_name'].value_counts()
month_max = month_counts.idxmax()
month_min = month_counts.idxmin()

summary_table = pd.DataFrame({
    'Time Feature': ['Weekday', 'Hour', 'Month'],
    'Highest Crime': [f"{weekday_max} ({weekday_counts[weekday_max]})",
                      f"{hour_max} ({hour_counts[hour_max]})",
                      f"{month_max} ({month_counts[month_max]})"],
    'Lowest Crime': [f"{weekday_min} ({weekday_counts[weekday_min]})",
                     f"{hour_min} ({hour_counts[hour_min]})",
                     f"{month_min} ({month_counts[month_min]})"]
})

print(summary_table)

  Time Feature   Highest Crime     Lowest Crime
0      Weekday  Sunday (10093)   Tuesday (8304)
1         Hour       20 (3791)         5 (1082)
2        Month     July (6100)  February (4580)

plt.figure(figsize=(8,6))
sns.scatterplot(x='longitude', y='latitude', data=crime_df, alpha=0.1, s=5)
plt.title('Crime Locations (All Data)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

# Calculate median latitude and longitude (good if your data has outliers)
center_lat = crime_df['latitude'].median()
center_lon = crime_df['longitude'].median()

crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=10)

# Plot a sample (for performance)
sample_df = crime_df[['latitude', 'longitude']].dropna().sample(500, random_state=1)

for idx, row in sample_df.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=3,
        color='red',
        fill=True,
        fill_opacity=0.3
    ).add_to(crime_map)

folium.Marker(
    location=[center_lat, center_lon],
    popup='Hotspot Center',
    icon=folium.Icon(color='blue', icon='info-sign')
).add_to(crime_map)

crime_map

# Center map at the median
center_lat = crime_df['latitude'].median()
center_lon = crime_df['longitude'].median()

crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=10)

# Prepare the data (drop missing and sample for speed)
heat_data = crime_df[['latitude', 'longitude']].dropna().sample(2000, random_state=1).values.tolist()

# Add heatmap layer
HeatMap(heat_data, radius=15, blur=5, min_opacity=0.2).add_to(crime_map)

crime_map

# Bin victim_age
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
age_labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90+']

crime_df['victim_age_group'] = pd.cut(crime_df['victim_age'], bins=age_bins, labels=age_labels, right=False)

def get_top_bottom_solve_rates(crime_df, feature, solved_label='Solved', n=3):
    # Calculate solve rates
    solve_by_feat = (
        crime_df.groupby(feature, observed=True)['case_solved']
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
    )
    # Sort by solve rate
    sorted_solve = solve_by_feat.sort_values(by=solved_label, ascending=False)
    # Get top n and bottom n (with solve rate > 0)
    top_n = sorted_solve[sorted_solve[solved_label] > 0].head(n)
    bottom_n = sorted_solve[sorted_solve[solved_label] > 0].tail(n)
    return top_n[[solved_label]].reset_index(), bottom_n[[solved_label]].reset_index()

# Features to analyze
features = [
    ('crime_description', 'Crime Type'),
    ('weapon_description', 'Weapon'),
    ('victim_ethnicity', 'Victim Ethnicity'),
    ('victim_gender', 'Victim Gender'),
    ('area_name', 'Area'),
    ('victim_age_group', 'Victim Age Group')
]

# Build summary DataFrame
summary_rows = []
for feat_col, feat_label in features:
    top, bottom = get_top_bottom_solve_rates(crime_df, feat_col)
    # Add top
    for i, row in top.iterrows():
        summary_rows.append({
            'Feature': feat_label,
            'Category': row[feat_col],
            'Solve Rate': round(row['Solved'], 3),
            'Rank': 'Top'
        })
    # Add bottom
    for i, row in bottom.iterrows():
        summary_rows.append({
            'Feature': feat_label,
            'Category': row[feat_col],
            'Solve Rate': round(row['Solved'], 3),
            'Rank': 'Bottom'
        })

summary_table = pd.DataFrame(summary_rows)
# Show top 3 and bottom 3 per feature in a pretty table
summary_table = summary_table.pivot_table(
    index=['Feature', 'Rank'], 
    values=['Category', 'Solve Rate'], 
    aggfunc=lambda x: list(x)
).reset_index()

# Display the summary table
print(summary_table.to_string(index=False))

         Feature   Rank                                                                                        Category            Solve Rate
            Area Bottom                                                                 [Hollywood, Pacific, Southeast]  [0.299, 0.294, 0.27]
            Area    Top                                                             [Mission, West Valley, N Hollywood] [0.566, 0.564, 0.547]
      Crime Type Bottom [THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER), THEFT PLAIN - ATTEMPT, BURGLARY FROM VEHICLE]   [0.106, 0.1, 0.058]
      Crime Type    Top                                                    [LYNCHING, PROWLER, MANSLAUGHTER, NEGLIGENT]       [1.0, 1.0, 1.0]
Victim Age Group Bottom                                                                             [10-19, 60-69, 0-9]  [0.375, 0.372, 0.33]
Victim Age Group    Top                                                                           [80-89, 20-29, 30-39]  [0.454, 0.428, 0.42]
Victim Ethnicity Bottom                                                                  [Vietnamese, Unknown, Chinese] [0.333, 0.283, 0.167]
Victim Ethnicity    Top                                                                    [Japanese, Filipino, Korean] [0.556, 0.545, 0.495]
   Victim Gender Bottom                                                                     [Unknown, Male, Non-binary] [0.455, 0.362, 0.284]
   Victim Gender    Top                                                                   [Homosexual, Female, Unknown]   [0.5, 0.462, 0.455]
          Weapon Bottom                        [UNKNOWN TYPE CUTTING INSTRUMENT, UNKNOWN FIREARM, SEMI-AUTOMATIC RIFLE] [0.174, 0.143, 0.143]
          Weapon    Top                                                                   [KITCHEN KNIFE, SWORD, GLASS] [0.668, 0.649, 0.637]

def get_top10_solve_rate(crime_df, group_col, solved_label='Solved'):
    solve_by_feature = (
        crime_df.groupby(group_col, observed=True)['case_solved']
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
        .sort_values(by=solved_label, ascending=False)
    )
    top10 = solve_by_feature[solved_label].head(10).reset_index()
    top10.columns = [group_col, 'solve_rate']
    return top10

# Add to features
features = [
    ('crime_description',   'Solve Rate by Crime Type (Top 10)',      'Crime Type'),
    ('weapon_description',  'Solve Rate by Weapon (Top 10)',          'Weapon'),
    ('victim_ethnicity',    'Solve Rate by Victim Ethnicity (Top 10)','Victim Ethnicity'),
    ('victim_gender',       'Solve Rate by Victim Gender (Top 10)',   'Victim Gender'),
    ('area_name',           'Solve Rate by Area (Top 10)',            'Area'),
    ('victim_age_group',    'Solve Rate by Victim Age Group (Top 10)', 'Victim Age Group')
]
dfs = [get_top10_solve_rate(crime_df, f[0]) for f in features]

# Make subplots
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=[f[1] for f in features],
    shared_xaxes=False, shared_yaxes=False,
    horizontal_spacing=0.20
)

for idx, (df, (col, title, ylabel)) in enumerate(zip(dfs, features)):
    row = idx // 2 + 1
    col_num = idx % 2 + 1
    fig.add_trace(
        go.Bar(
            x=df['solve_rate'],
            y=df[col],
            orientation='h',
            marker_color='steelblue',
            text=(df['solve_rate']*100).round(1).astype(str)+'%',
            textposition='auto',
        ),
        row=row, col=col_num
    )
    fig.update_yaxes(title_text=ylabel, row=row, col=col_num)
    fig.update_xaxes(range=[0, 1], title_text='Solve Rate', row=row, col=col_num)

fig.update_layout(
    height=1200, width=1000,
    title_text="Solve Rate by Feature (Top 10)",
    showlegend=False,
    font=dict(size=8),
    margin=dict(l=0, r=0, t=60, b=5),
)
for i in range(1, len(features) + 1):
    fig['layout']['annotations'][i-1]['font'] = dict(size=12)

fig.show()

importances = None

# Prepare data
features_for_model = ['crime_description', 'weapon_description', 'victim_ethnicity',
                      'victim_gender', 'area_name', 'victim_age_group']
df_model = crime_df.dropna(subset=features_for_model + ['case_solved'])

# Encode categorical features numerically
df_model_encoded = df_model.copy()
label_encoders = {}
for col in features_for_model + ['case_solved']:
    le = LabelEncoder()
    df_model_encoded[col] = le.fit_transform(df_model_encoded[col].astype(str))
    label_encoders[col] = le

X = df_model_encoded[features_for_model]
y = df_model_encoded['case_solved']

# Fit model
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)
model = RandomForestClassifier(n_estimators=100, random_state=1)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_

imp_df = pd.DataFrame({'feature': features_for_model, 'importance': importances}).sort_values('importance', ascending=False)

fig = px.bar(imp_df, x='importance', y='feature', orientation='h',
             title='Feature Importance for Predicting Case Solved',
             text=imp_df['importance'].round(3))
fig.update_traces(textposition='outside')
fig.update_layout(yaxis_title='Feature', xaxis_title='Importance')
fig.show()

numeric_cols = crime_df.select_dtypes(include='number')
corr = numeric_cols.corr()

fig = ff.create_annotated_heatmap(
    z=corr.values,
    x=list(corr.columns),
    y=list(corr.index),
    annotation_text=corr.round(2).values,
    colorscale='Viridis'
)
fig.update_layout(title_text='Correlation Heatmap of Numeric Features', width=800, height=800)
fig.show()

crime_df.dtypes

date_reported          datetime64[ns]
date_occurred          datetime64[ns]
area                         category
area_name                    category
part                         category
crime_code                   category
victim_age                      int32
victim_gender                  object
victim_ethnicity               object
premise_code                 category
premise_description          category
weapon_code                  category
crime_code_1                 category
latitude                      float64
longitude                     float64
case_solved                  category
weapon_description           category
crime_description            category
year                            int32
month                           int32
day_of_week                    object
month_name                     object
hour                            int32
victim_age_group             category
dtype: object

# Weekend indicator
crime_df['is_weekend'] = crime_df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)

# Season (basic example)
crime_df['season'] = crime_df['month'].map({12:'Winter', 1:'Winter', 2:'Winter',
                                            3:'Spring', 4:'Spring', 5:'Spring',
                                            6:'Summer', 7:'Summer', 8:'Summer',
                                            9:'Fall', 10:'Fall', 11:'Fall'})

# Day of the year
crime_df['day_of_year'] = crime_df['date_occurred'].dt.dayofyear

# Interaction Features Victim Gender/Ethnicity
crime_df['victim_gender_ethnicity'] = (
    crime_df['victim_gender'].astype(str) + '_' + crime_df['victim_ethnicity'].astype(str)
)

# Interaction Features Area/Crime Type
crime_df['area_crime_type'] = (
    crime_df['area_name'].astype(str) + '_' + crime_df['crime_description'].astype(str)
)

# Create a copy for modeling/encoding
crime_model_df = crime_df.copy()

cat_cols = [
    'area', 'area_name', 'part', 'crime_code', 'victim_gender', 'victim_ethnicity',
    'premise_code', 'premise_description', 'weapon_code', 'crime_code_1','day_of_week', 'month_name',
    'case_solved', 'weapon_description', 'crime_description', 'victim_age_group', 'season',
    'victim_gender_ethnicity', 'area_crime_type'
]

label_encoders = {}
for col in cat_cols:
    if col in crime_model_df.columns:
        le = LabelEncoder()
        crime_model_df[col] = le.fit_transform(crime_model_df[col].astype(str))
        label_encoders[col] = le

crime_model_df.dtypes

date_reported              datetime64[ns]
date_occurred              datetime64[ns]
area                                int32
area_name                           int32
part                                int32
crime_code                          int32
victim_age                          int32
victim_gender                       int32
victim_ethnicity                    int32
premise_code                        int32
premise_description                 int32
weapon_code                         int32
crime_code_1                        int32
latitude                          float64
longitude                         float64
case_solved                         int32
weapon_description                  int32
crime_description                   int32
year                                int32
month                               int32
day_of_week                         int32
month_name                          int32
hour                                int32
victim_age_group                    int32
is_weekend                          int32
season                              int32
day_of_year                         int32
victim_gender_ethnicity             int32
area_crime_type                     int32
dtype: object

crime_model_df.head()

numeric_cols = ['victim_age', 'latitude', 'longitude', 'year', 'month', 'hour', 'day_of_year']
scaler = StandardScaler()
crime_model_df[numeric_cols] = scaler.fit_transform(crime_model_df[numeric_cols])

# Drop ID/time columns if not used in modeling
feature_cols = [
    'victim_age', 'latitude', 'longitude', 'hour', 'year', 'month', 'day_of_year',
    'area', 'area_name', 'part', 'crime_code', 'victim_gender', 'victim_ethnicity',
    'premise_code', 'premise_description', 'weapon_code', 'crime_code_1',
    'weapon_description', 'crime_description', 'victim_age_group',
    'is_weekend', 'season', 'victim_gender_ethnicity', 'area_crime_type'
]

X = crime_model_df[feature_cols].copy()
y = crime_model_df['case_solved'].copy()

# First split: Train+Validation vs. Test
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
# Second split: Train vs. Validation
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, stratify=y_trainval, random_state=42)
# (Now: 60% train, 20% validation, 20% test)

print(f"Train X (Independent Features): {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")
print(f"Train y (Dependent Variable): {y_train.shape}, Val: {y_val.shape}, Test: {y_test.shape}")

Train X (Independent Features): (38318, 24), Val: (12773, 24), Test: (12773, 24)
Train y (Dependent Variable): (38318,), Val: (12773,), Test: (12773,)

# Main parameters shown: penalty, C, solver, max_iter, random_state
logreg = LogisticRegression(
    penalty='l2',          # Regularization type
    C=1.0,                 # Inverse of regularization strength
    solver='lbfgs',        # Optimization algorithm
    max_iter=10000,         # Max number of iterations
    random_state=42        # Seed
)
logreg.fit(X_train, y_train)

print("Logistic Regression fit parameters:")
print(logreg.get_params())

Logistic Regression fit parameters:
{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

# Main parameters shown: n_estimators, max_depth, min_samples_split, random_state
rf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=None,        # Max tree depth
    min_samples_split=2,   # Min samples to split an internal node
    random_state=42,       # Seed
    n_jobs=-1              # Use all CPU cores
)
rf.fit(X_train, y_train)

print("Random Forest fit parameters:")
print(rf.get_params())

Random Forest fit parameters:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': -1, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}

# Validation
y_val_pred_logreg = logreg.predict(X_val)
y_val_pred_rf     = rf.predict(X_val)

print("LogReg Validation Accuracy:", accuracy_score(y_val, y_val_pred_logreg))
print("RF Validation Accuracy:", accuracy_score(y_val, y_val_pred_rf))

# Final test can be performed after selecting the best model

LogReg Validation Accuracy: 0.6323494871995615
RF Validation Accuracy: 0.7042198387223049

# Predict on validation and test sets (use your models: logreg, rf)
y_val_pred_logreg = logreg.predict(X_val)
y_val_pred_rf     = rf.predict(X_val)
y_test_pred_logreg = logreg.predict(X_test)
y_test_pred_rf     = rf.predict(X_test)

# For AUC, need predicted probabilities (for positive class)
y_val_proba_logreg = logreg.predict_proba(X_val)[:,1]
y_val_proba_rf     = rf.predict_proba(X_val)[:,1]
y_test_proba_logreg = logreg.predict_proba(X_test)[:,1]
y_test_proba_rf     = rf.predict_proba(X_test)[:,1]

def print_metrics(y_true, y_pred, y_proba, model_name="Model"):
    print(f"\n{model_name} Performance:")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall   :", recall_score(y_true, y_pred))
    print("F1-score :", f1_score(y_true, y_pred))
    print("AUC-ROC  :", roc_auc_score(y_true, y_proba))
    print(classification_report(y_true, y_pred))

print_metrics(y_val, y_val_pred_logreg, y_val_proba_logreg, "Logistic Regression (Val)")
print_metrics(y_val, y_val_pred_rf, y_val_proba_rf, "Random Forest (Val)")

print_metrics(y_test, y_test_pred_logreg, y_test_proba_logreg, "Logistic Regression (Test)")
print_metrics(y_test, y_test_pred_rf, y_test_proba_rf, "Random Forest (Test)")

Logistic Regression (Val) Performance:
Accuracy : 0.6323494871995615
Precision: 0.579849946409432
Recall   : 0.3155745673731285
F1-score : 0.4087131704860237
AUC-ROC  : 0.6499525115128045
              precision    recall  f1-score   support

           0       0.65      0.85      0.73      7630
           1       0.58      0.32      0.41      5143

    accuracy                           0.63     12773
   macro avg       0.61      0.58      0.57     12773
weighted avg       0.62      0.63      0.60     12773


Random Forest (Val) Performance:
Accuracy : 0.7042198387223049
Precision: 0.6690611840475601
Recall   : 0.5251798561151079
F1-score : 0.5884531590413943
AUC-ROC  : 0.7548723162379027
              precision    recall  f1-score   support

           0       0.72      0.82      0.77      7630
           1       0.67      0.53      0.59      5143

    accuracy                           0.70     12773
   macro avg       0.69      0.68      0.68     12773
weighted avg       0.70      0.70      0.70     12773


Logistic Regression (Test) Performance:
Accuracy : 0.6248336334455492
Precision: 0.5600821636425881
Recall   : 0.3181022749368073
F1-score : 0.40575396825396826
AUC-ROC  : 0.6369829558760982
              precision    recall  f1-score   support

           0       0.64      0.83      0.73      7630
           1       0.56      0.32      0.41      5143

    accuracy                           0.62     12773
   macro avg       0.60      0.57      0.57     12773
weighted avg       0.61      0.62      0.60     12773


Random Forest (Test) Performance:
Accuracy : 0.6923980270883896
Precision: 0.650844930417495
Recall   : 0.5092358545595955
F1-score : 0.5713974037307734
AUC-ROC  : 0.7365395813419047
              precision    recall  f1-score   support

           0       0.71      0.82      0.76      7630
           1       0.65      0.51      0.57      5143

    accuracy                           0.69     12773
   macro avg       0.68      0.66      0.67     12773
weighted avg       0.69      0.69      0.68     12773

def plot_confusion(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(4,4))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(title)
    plt.show()

plot_confusion(y_val, y_val_pred_logreg, "LogReg Validation Confusion Matrix")
plot_confusion(y_val, y_val_pred_rf, "Random Forest Validation Confusion Matrix")

RocCurveDisplay.from_estimator(logreg, X_val, y_val)
plt.title("Logistic Regression ROC Curve (Validation)")
plt.show()

RocCurveDisplay.from_estimator(rf, X_val, y_val)
plt.title("Random Forest ROC Curve (Validation)")
plt.show()

print(y.value_counts(normalize=True))

case_solved
0    0.597379
1    0.402621
Name: proportion, dtype: float64

	division_number	date_reported	date_occurred	area	area_name	reporting_district	part	crime_code	modus_operandi	victim_age	...	crime_code_1	crime_code_2	crime_code_3	crime_code_4	incident_admincode	location	cross_street	latitude	longitude	case_solved
0	211414090	2021-06-27	2021-06-20 20:00:00	14	Pacific	1464	1	480	0344	32.0	...	480.0	NaN	NaN	NaN	0	12400 FIELDING	NaN	33.9791	-118.4092	Not solved
1	210504861	2021-01-22	2021-01-21 22:00:00	5	Harbor	515	1	510	NaN	0.0	...	510.0	NaN	NaN	NaN	1	1500 BAY VIEW AV	NaN	33.7929	-118.2710	Solved
2	210104843	2021-01-21	2021-01-21 02:00:00	1	Central	139	1	510	NaN	0.0	...	510.0	NaN	NaN	NaN	1	300 S SANTA FE AV	NaN	34.0420	-118.2326	Solved
3	210115564	2021-08-22	2021-08-22 07:00:00	1	Central	151	1	350	1308 0344 0345 1822	29.0	...	350.0	NaN	NaN	NaN	0	7TH	FIGUEROA	34.0496	-118.2603	Not solved
4	211421187	2021-11-09	2021-11-07 19:00:00	14	Pacific	1465	1	510	NaN	0.0	...	510.0	NaN	NaN	NaN	0	5500 MESMER AV	NaN	33.9869	-118.4022	Not solved

	crime_code	crime_description
0	480	BIKE - STOLEN
1	510	VEHICLE - STOLEN
2	350	THEFT, PERSON
3	440	THEFT PLAIN - PETTY ($950 & UNDER)
4	420	THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)

	crime_description
crime_code
480	BIKE - STOLEN
510	VEHICLE - STOLEN
350	THEFT, PERSON
440	THEFT PLAIN - PETTY ($950 & UNDER)
420	THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)

	date_reported	date_occurred	victim_age	latitude	longitude
count	69479	69479	69479.000000	69479.000000	69479.000000
mean	2021-07-07 08:10:23.428661760	2021-07-05 20:22:01.095870720	34.621022	33.807383	-117.457399
min	2021-01-01 00:00:00	2021-01-01 00:01:00	0.000000	0.000000	-118.667200
25%	2021-04-12 00:00:00	2021-04-10 02:55:00	24.000000	33.996900	-118.399800
50%	2021-07-11 00:00:00	2021-07-09 12:45:00	33.000000	34.049900	-118.304900
75%	2021-10-02 00:00:00	2021-09-30 19:05:00	46.000000	34.110400	-118.269600
max	2021-12-31 00:00:00	2021-12-31 23:30:00	99.000000	34.334300	0.000000
std	NaN	NaN	17.686676	2.932096	10.180443

	weapon_code	weapon_description
0	400	STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)
1	500	UNKNOWN WEAPON/OTHER WEAPON
2	102	HAND GUN
3	201	KNIFE WITH BLADE OVER 6 INCHES IN LENGTH
4	107	OTHER FIREARM

	date_reported	date_occurred	area	area_name	part	crime_code	victim_age	victim_gender	victim_ethnicity	premise_code	premise_description	weapon_code	crime_code_1	latitude	longitude	case_solved
23	2021-08-21	2021-08-21 23:51:00	12	77th Street	1	210	51	Female	Black	101.0	STREET	400.0	210.0	33.9897	-118.2915	Solved
28	2021-02-20	2021-02-20 22:50:00	2	Rampart	2	745	0	Female	Unknown	122.0	VEHICLE, PASSENGER/TRUCK	500.0	745.0	34.0891	-118.2992	Not solved
29	2021-07-29	2021-07-28 21:00:00	14	Pacific	1	310	0	Non-binary	Unknown	120.0	STORAGE SHED	500.0	310.0	33.9586	-118.4485	Not solved
36	2021-12-25	2021-12-25 19:00:00	1	Central	1	210	53	Male	White	101.0	STREET	500.0	210.0	34.0430	-118.2420	Not solved
37	2021-06-24	2021-06-24 02:35:00	6	Hollywood	1	210	40	Male	Black	101.0	STREET	102.0	210.0	34.0998	-118.3288	Not solved

	weapon_description
weapon_code
400	STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)
500	UNKNOWN WEAPON/OTHER WEAPON
102	HAND GUN
201	KNIFE WITH BLADE OVER 6 INCHES IN LENGTH
107	OTHER FIREARM

	area	area_name	part	crime_code	victim_gender	victim_ethnicity	premise_code	premise_description	weapon_code	crime_code_1	case_solved	weapon_description	crime_description
count	63864	63864	63864	63864	63864	63864	63864.0	63864	63864.0	63864.0	63864	63864	63864
unique	21	21	2	100	5	12	256.0	255	73.0	100.0	2	73	100
top	12	77th Street	2	624	Male	Hispanic	101.0	STREET	400.0	624.0	Not solved	STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)	BATTERY - SIMPLE ASSAULT
freq	6189	6189	34108	13980	31379	28763	14274.0	14274	33607.0	14039.0	38151	33607	13980

University of Niagara Falls, Canada¶

Master of Data Analytics¶

Python for Data Analysis (CPSC-610-13)¶

Spring 2025¶

Final Course Project¶

Part 4: Comprehensive Case Study – Crime Case¶

Data Cleaning¶

Install additional libraries¶

Imports¶

Identify and Handle Missing Values¶

Handling Missing Data¶

Remove or review duplicates and inconsistencies¶

Ensure correct data types¶

Normalize inconsistent categorical values¶

Removing columns without documentation or information to be handled¶

Victim Gender Value Standardization¶

Victim Ethnicity Value Standardization¶

Exploratory Data Analysis (EDA)¶

Merging the datasets for analysis and reporting¶

Generate summary statistics¶

Visualize distributions and correlations¶

Analyze time- and location-based patterns¶

Explore solve rate by various features¶

Visualization¶

Feature Engineering¶

Transform or Derive Features¶

Encode Categorical Variables¶

Normalize or Scale Features¶

Model Building¶

Prepare Features and Target¶

Train-Test-Validation Split¶

Model 1: Logistic Regression¶

Model 2: Random Forest¶

Initial Validation & Test¶

Evaluate how well your models perform¶

Generate All Key Metric¶

Display Metrics¶

Confusion Matrix Plots¶

ROC Curves¶

Insights, Interpretation, and Reporting¶

Executive Summary¶

Factors Influencing Case Solvability¶

Spatial, Temporal, and Demographic Patterns¶

Recommendations and Suggested Interventions¶

Limitations¶

Conclusions¶