University of Niagara Falls, Canada¶

Master of Data Analytics¶

Python for Data Analysis (CPSC-610-13)¶

Spring 2025¶

Final Course Project¶

Part 4: Comprehensive Case Study – Crime Case¶

Mario Fernando Zamudio Portilla (NF1002499)

This final part is mandatory and must be submitted. It focuses on analyzing and modeling the LAPD Crime Case Dataset (crime_dataset.csv) provided to you.

In addition, two lookup files are included to help you interpret and enrich the categorical codes related to crime types and weapon types. You are expected to use them appropriately as part of your analysis.

Your goal is to apply the full range of skills covered in this course, including data processing, analysis, and machine learning. You are expected to deliver a structured analysis and a final report containing your findings, supported by data, visualizations, and code.

Data Cleaning¶

Prepare the dataset for analysis by resolving common data quality issues.

  • Identify and handle missing values
  • Remove or review duplicates and inconsistencies
  • Ensure correct data types
  • Normalize inconsistent categorical values

Install additional libraries¶

In [1]:
!pip install folium
!pip install branca
Requirement already satisfied: folium in c:\users\mzamu\anaconda3\lib\site-packages (0.20.0)
Requirement already satisfied: branca>=0.6.0 in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (0.8.1)
Requirement already satisfied: jinja2>=2.9 in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (3.1.4)
Requirement already satisfied: numpy in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (1.26.4)
Requirement already satisfied: requests in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (2.32.3)
Requirement already satisfied: xyzservices in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (2022.9.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\mzamu\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (2025.4.26)
Requirement already satisfied: branca in c:\users\mzamu\anaconda3\lib\site-packages (0.8.1)
Requirement already satisfied: jinja2>=3 in c:\users\mzamu\anaconda3\lib\site-packages (from branca) (3.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\mzamu\anaconda3\lib\site-packages (from jinja2>=3->branca) (2.1.3)

Imports¶

In [2]:
# Data manipulation libraries
import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Machine learning libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix, 
    f1_score,
    precision_score, 
    recall_score, 
    roc_auc_score,
    RocCurveDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
In [3]:
# Load datasets
crime_df = pd.read_csv('crime_dataset.csv')
weapon_types_df = pd.read_csv('weapon_types.csv')
crime_types_df = pd.read_csv('crime_types.csv')

# Quick preview
print("Crime Dataset:")
display(crime_df.shape)
display(crime_df.head())

print("\nWeapon Types:")
display(weapon_types_df.shape)
display(weapon_types_df.head())

print("\nCrime Types:")
display(crime_types_df.shape)
display(crime_types_df.head())
Crime Dataset:
(203089, 25)
division_number date_reported date_occurred area area_name reporting_district part crime_code modus_operandi victim_age ... crime_code_1 crime_code_2 crime_code_3 crime_code_4 incident_admincode location cross_street latitude longitude case_solved
0 211414090 2021-06-27 2021-06-20 20:00:00 14 Pacific 1464 1 480 0344 32.0 ... 480.0 NaN NaN NaN 0 12400 FIELDING NaN 33.9791 -118.4092 Not solved
1 210504861 2021-01-22 2021-01-21 22:00:00 5 Harbor 515 1 510 NaN 0.0 ... 510.0 NaN NaN NaN 1 1500 BAY VIEW AV NaN 33.7929 -118.2710 Solved
2 210104843 2021-01-21 2021-01-21 02:00:00 1 Central 139 1 510 NaN 0.0 ... 510.0 NaN NaN NaN 1 300 S SANTA FE AV NaN 34.0420 -118.2326 Solved
3 210115564 2021-08-22 2021-08-22 07:00:00 1 Central 151 1 350 1308 0344 0345 1822 29.0 ... 350.0 NaN NaN NaN 0 7TH FIGUEROA 34.0496 -118.2603 Not solved
4 211421187 2021-11-09 2021-11-07 19:00:00 14 Pacific 1465 1 510 NaN 0.0 ... 510.0 NaN NaN NaN 0 5500 MESMER AV NaN 33.9869 -118.4022 Not solved

5 rows × 25 columns

Weapon Types:
(73, 2)
weapon_code weapon_description
0 400 STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)
1 500 UNKNOWN WEAPON/OTHER WEAPON
2 102 HAND GUN
3 201 KNIFE WITH BLADE OVER 6 INCHES IN LENGTH
4 107 OTHER FIREARM
Crime Types:
(133, 2)
crime_code crime_description
0 480 BIKE - STOLEN
1 510 VEHICLE - STOLEN
2 350 THEFT, PERSON
3 440 THEFT PLAIN - PETTY ($950 & UNDER)
4 420 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)

Identify and Handle Missing Values¶

In [4]:
# Check missing values in each dataset
print("Missing values in crime_df:\n", crime_df.isnull().sum())
print("\nMissing values in weapon_types_df:\n", weapon_types_df.isnull().sum())
print("\nMissing values in crime_types_df:\n", crime_types_df.isnull().sum())
Missing values in crime_df:
 division_number             0
date_reported               0
date_occurred               0
area                        0
area_name                   0
reporting_district          0
part                        0
crime_code                  0
modus_operandi          28929
victim_age               4062
victim_gender           27729
victim_ethnicity        27731
premise_code                3
premise_description        92
weapon_code            131843
crime_code_1                2
crime_code_2           187239
crime_code_3           202560
crime_code_4           203071
incident_admincode          0
location                    0
cross_street           167414
latitude                    0
longitude                   0
case_solved                 0
dtype: int64

Missing values in weapon_types_df:
 weapon_code           0
weapon_description    0
dtype: int64

Missing values in crime_types_df:
 crime_code           0
crime_description    0
dtype: int64
Handling Missing Data¶

In this analysis, missing values are expected in several columns due to the nature of crime reporting. For columns where missingness indicates unreported or inapplicable information (e.g., weapon_code, victim_gender, modus_operandi), we filled missing values with the label "Unknown". For victim_age, we used the median age to impute missing values, ensuring statistical consistency. Columns with more than 95% missing values and no analytical value were dropped to streamline the dataset.

In [5]:
# Fill missing categorical data with a placeholder
for col in ['victim_gender', 'victim_ethnicity', 'modus_operandi', 
            'premise_description', 'cross_street']:
    crime_df[col] = crime_df[col].fillna('Unknown')

# Fill missing weapon_code data with a 500 this is the value for be categorized based on the weapon types dataset
crime_df[col] = crime_df['weapon_code'].fillna(500)

# For numeric columns 
crime_df = crime_df.dropna(subset=['victim_age', 'latitude', 'longitude'])

# Columns with excessive missing values 
threshold = 0.90  # Drop columns with >90% missing
cols_to_drop = [col for col in crime_df.columns if crime_df[col].isna().mean() > threshold]
crime_df = crime_df.drop(columns=cols_to_drop)
print("Dropped columns due to excessive missing values:", cols_to_drop)

crime_df = crime_df.dropna()
Dropped columns due to excessive missing values: ['crime_code_2', 'crime_code_3', 'crime_code_4']
In [6]:
print("Missing values in crime_df:\n", crime_df.isnull().sum())
crime_df.shape
Missing values in crime_df:
 division_number        0
date_reported          0
date_occurred          0
area                   0
area_name              0
reporting_district     0
part                   0
crime_code             0
modus_operandi         0
victim_age             0
victim_gender          0
victim_ethnicity       0
premise_code           0
premise_description    0
weapon_code            0
crime_code_1           0
incident_admincode     0
location               0
cross_street           0
latitude               0
longitude              0
case_solved            0
dtype: int64
Out[6]:
(69823, 22)

Remove or review duplicates and inconsistencies¶

In [7]:
# Check for duplicates
print("Duplicates in crime_df:", crime_df.duplicated().sum())
print("Duplicates in weapon_types_df:", weapon_types_df.duplicated().sum())
print("Duplicates in crime_types_df:", crime_types_df.duplicated().sum())
Duplicates in crime_df: 0
Duplicates in weapon_types_df: 0
Duplicates in crime_types_df: 0
In [8]:
# Remove duplicates
crime_df = crime_df.drop_duplicates()
weapon_types_df = weapon_types_df.drop_duplicates()
crime_types_df = crime_types_df.drop_duplicates()

Ensure correct data types¶

In [9]:
# Check data types

print("Crime Dataset:")
print(crime_df.dtypes)

print("\nWeapon Types:")
print(weapon_types_df.dtypes)

print("\nCrime Types:")
print(crime_types_df.dtypes)
Crime Dataset:
division_number          int64
date_reported           object
date_occurred           object
area                     int64
area_name               object
reporting_district       int64
part                     int64
crime_code               int64
modus_operandi          object
victim_age             float64
victim_gender           object
victim_ethnicity        object
premise_code           float64
premise_description     object
weapon_code            float64
crime_code_1           float64
incident_admincode       int64
location                object
cross_street           float64
latitude               float64
longitude              float64
case_solved             object
dtype: object

Weapon Types:
weapon_code            int64
weapon_description    object
dtype: object

Crime Types:
crime_code            int64
crime_description    object
dtype: object
In [10]:
# Convert date columns
crime_df['date_reported'] = pd.to_datetime(crime_df['date_reported'], errors='coerce')
crime_df['date_occurred'] = pd.to_datetime(crime_df['date_occurred'], errors='coerce')

# Ensure categorical types
crime_df['division_number'] = crime_df['division_number'].astype('category')
crime_df['area'] = crime_df['area'].astype('category')
crime_df['reporting_district'] = crime_df['reporting_district'].astype('category')
crime_df['part'] = crime_df['part'].astype('category')
crime_df['crime_code'] = crime_df['crime_code'].astype('category')
crime_df['premise_code'] = crime_df['premise_code'].astype('category')
crime_df['crime_code_1'] = crime_df['crime_code_1'].astype('category')
crime_df['incident_admincode'] = crime_df['incident_admincode'].astype('category')
crime_df['weapon_code'] = crime_df['weapon_code'].astype('category')
crime_df['area_name'] = crime_df['area_name'].astype('category')
crime_df['case_solved'] = crime_df['case_solved'].astype('category')
crime_df['premise_description'] = crime_df['premise_description'].astype('category')
crime_df['victim_ethnicity'] = crime_df['victim_ethnicity'].astype('category')
crime_df['victim_gender'] = crime_df['victim_gender'].astype('category')

weapon_types_df['weapon_code'] = weapon_types_df['weapon_code'].astype('category')
weapon_types_df['weapon_description'] = weapon_types_df['weapon_description'].astype('category')
crime_types_df['crime_description'] = crime_types_df['crime_description'].astype('category')
crime_types_df['crime_code'] = crime_types_df['crime_code'].astype('category')

# Ensure number types
crime_df['victim_age'] = crime_df['victim_age'].astype(int)
crime_df['latitude'] = pd.to_numeric(crime_df['latitude'], errors='coerce')
crime_df['longitude'] = pd.to_numeric(crime_df['longitude'], errors='coerce')

weapon_types_df = weapon_types_df.set_index('weapon_code')
crime_types_df = crime_types_df.set_index('crime_code')

print("Crime Dataset:")
print(crime_df.dtypes)

print("\nWeapon Types:")
print(weapon_types_df.dtypes)

print("\nCrime Types:")
print(crime_types_df.dtypes)
Crime Dataset:
division_number              category
date_reported          datetime64[ns]
date_occurred          datetime64[ns]
area                         category
area_name                    category
reporting_district           category
part                         category
crime_code                   category
modus_operandi                 object
victim_age                      int32
victim_gender                category
victim_ethnicity             category
premise_code                 category
premise_description          category
weapon_code                  category
crime_code_1                 category
incident_admincode           category
location                       object
cross_street                  float64
latitude                      float64
longitude                     float64
case_solved                  category
dtype: object

Weapon Types:
weapon_description    category
dtype: object

Crime Types:
crime_description    category
dtype: object

Normalize inconsistent categorical values¶

In [11]:
# Loop through all categorical columns and show top 10 most frequent values
categorical_cols = crime_df.select_dtypes(include=['category', 'object']).columns

for col in categorical_cols:
    print(f"\nColumn: {col}")
    value_counts = crime_df[col].value_counts(dropna=False)
    top_10 = value_counts.head(10)
    print("Top 10 most frequent values:")
    print(top_10)
    print(f"Total unique values: {crime_df[col].nunique(dropna=True)}")
Column: division_number
Top 10 most frequent values:
division_number
201600896    1
211406652    1
211406712    1
211406711    1
211406710    1
211406707    1
211406704    1
211406673    1
211406645    1
211407280    1
Name: count, dtype: int64
Total unique values: 69823

Column: area
Top 10 most frequent values:
area
12    6259
1     5224
18    4757
6     4674
3     4473
13    4033
2     3904
20    3764
14    3325
5     2971
Name: count, dtype: int64
Total unique values: 21

Column: area_name
Top 10 most frequent values:
area_name
77th Street    6259
Central        5224
Southeast      4757
Hollywood      4674
Southwest      4473
Newton         4033
Rampart        3904
Olympic        3764
Pacific        3325
Harbor         2971
Name: count, dtype: int64
Total unique values: 21

Column: reporting_district
Top 10 most frequent values:
reporting_district
645     520
636     462
646     403
1822    345
162     333
119     317
245     305
1241    290
1268    287
666     286
Name: count, dtype: int64
Total unique values: 1138

Column: part
Top 10 most frequent values:
part
2    37675
1    32148
Name: count, dtype: int64
Total unique values: 2

Column: crime_code
Top 10 most frequent values:
crime_code
624    15410
230    12168
626    10216
210     7064
930     4042
761     3218
236     2912
740     1386
310     1200
220     1093
Name: count, dtype: int64
Total unique values: 102

Column: modus_operandi
Top 10 most frequent values:
modus_operandi
0416         687
0416 1822    300
0329         243
0421         236
0400         235
1822 0416    225
1501         190
0416 0913    186
0400 0416    179
0444         177
Name: count, dtype: int64
Total unique values: 51648

Column: victim_gender
Top 10 most frequent values:
victim_gender
M          31070
F          28735
X           4052
Male        3116
Female      2831
Unknown       11
H              8
Name: count, dtype: int64
Total unique values: 7

Column: victim_ethnicity
Top 10 most frequent values:
victim_ethnicity
H       31804
B       15659
W       11550
O        4658
X        4587
A        1348
K         109
F          41
Rare       18
I          15
Name: count, dtype: int64
Total unique values: 14

Column: premise_code
Top 10 most frequent values:
premise_code
101.0    15431
501.0    12222
502.0    11612
102.0     6924
108.0     4123
203.0     2732
210.0      971
122.0      965
103.0      818
109.0      747
Name: count, dtype: int64
Total unique values: 264

Column: premise_description
Top 10 most frequent values:
premise_description
STREET                                          15431
SINGLE FAMILY DWELLING                          12222
MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)    11612
SIDEWALK                                         6924
PARKING LOT                                      4123
OTHER BUSINESS                                   2732
RESTAURANT/FAST FOOD                              971
VEHICLE, PASSENGER/TRUCK                          965
ALLEY                                             818
PARK/PLAYGROUND                                   747
Name: count, dtype: int64
Total unique values: 263

Column: weapon_code
Top 10 most frequent values:
weapon_code
400.0    37044
500.0     6943
511.0     4963
102.0     4702
109.0     1639
106.0     1581
200.0     1540
207.0     1229
512.0      815
307.0      756
Name: count, dtype: int64
Total unique values: 73

Column: crime_code_1
Top 10 most frequent values:
crime_code_1
624.0    15479
230.0    12170
626.0    10277
210.0     7066
930.0     4023
761.0     3095
236.0     2913
740.0     1418
310.0     1203
220.0     1093
Name: count, dtype: int64
Total unique values: 102

Column: incident_admincode
Top 10 most frequent values:
incident_admincode
0    41187
1    28636
Name: count, dtype: int64
Total unique values: 2

Column: location
Top 10 most frequent values:
location
6TH                                       242
7TH                                       237
800 N  ALAMEDA                      ST    221
7TH                          ST           219
6TH                          ST           209
WESTERN                      AV           190
5TH                                       179
HOLLYWOOD                                 171
BROADWAY                                  167
FIGUEROA                     ST           166
Name: count, dtype: int64
Total unique values: 20541

Column: case_solved
Top 10 most frequent values:
case_solved
Not solved    41275
Solved        28548
Name: count, dtype: int64
Total unique values: 2

Removing columns without documentation or information to be handled¶

modus_operandi, division_number, reporting_district

In [12]:
crime_df = crime_df.drop(columns=['modus_operandi','division_number','reporting_district', 'cross_street', 'incident_admincode','location'])

crime_df = crime_df.drop_duplicates()
Victim Gender Value Standardization¶

The victim_gender column initially contained abbreviations, full words, and ambiguous codes ('M', 'Male', 'F', 'Female', 'X', 'H', 'Unknown'). To ensure consistency, all values were mapped to a standard set: 'Male', 'Female', 'Non-binary', and 'Homosexual'.

This avoids double-counting and clarifies how non-binary or unspecified genders are represented.

In [13]:
# Normalizing values for 'victim_gender'
gender_map = {
    'M': 'Male',
    'F': 'Female',
    'Male': 'Male',
    'Female': 'Female',
    'X': 'Non-binary',
    'H': 'Homosexual', 
    'Unknown': 'Unknown'
}
crime_df['victim_gender'] = crime_df['victim_gender'].map(gender_map).fillna('Unknown')

print(crime_df['victim_gender'].unique())
['Female' 'Non-binary' 'Male' 'Homosexual' 'Unknown']
Victim Ethnicity Value Standardization¶

The victim_ethnicity column contained a variety of single-letter codes and ambiguous values, such as 'W', 'H', 'O', 'X', and 'Rare'. To improve clarity, these values were mapped to broader, standardized categories (e.g., 'White', 'Black', 'Hispanic', 'Asian', 'Other', 'Unknown'). This ensures more consistent and interpretable results in the analysis.

In [14]:
# Ethnicity mapping (adjust as needed for your assignment or codebook)
ethnicity_map = {
    'W': 'White',
    'B': 'Black',
    'H': 'Hispanic',
    'A': 'Asian',
    'C': 'Chinese',
    'I': 'Indigenous',
    'K': 'Korean',    # Without a codebook, group as 'Other'
    'F': 'Filipino',
    'V': 'Vietnamese',
    'J': 'Japanese',
    'Z': 'Other',
    'O': 'Other',
    'X': 'Unknown',
    'Unknown': 'Unknown',
    'Rare': 'Other'
}

crime_df['victim_ethnicity'] = crime_df['victim_ethnicity'].map(ethnicity_map).fillna('Unknown')

# Check new unique values
print(crime_df['victim_ethnicity'].unique())
['Black' 'Unknown' 'White' 'Hispanic' 'Other' 'Asian' 'Indigenous'
 'Filipino' 'Korean' 'Chinese' 'Vietnamese' 'Japanese']

Exploratory Data Analysis (EDA)¶

Explore and understand the structure, trends, and relationships in the data.

  • Perform required operations to bring in extra info from the provided crime and weapon lookup files.
  • Generate summary statistics
  • Visualize distributions and correlations
  • Analyze time- and location-based patterns
  • Explore solve rate by various features

Merging the datasets for analysis and reporting¶

In [15]:
crime_df.head(5)
Out[15]:
date_reported date_occurred area area_name part crime_code victim_age victim_gender victim_ethnicity premise_code premise_description weapon_code crime_code_1 latitude longitude case_solved
23 2021-08-21 2021-08-21 23:51:00 12 77th Street 1 210 51 Female Black 101.0 STREET 400.0 210.0 33.9897 -118.2915 Solved
28 2021-02-20 2021-02-20 22:50:00 2 Rampart 2 745 0 Female Unknown 122.0 VEHICLE, PASSENGER/TRUCK 500.0 745.0 34.0891 -118.2992 Not solved
29 2021-07-29 2021-07-28 21:00:00 14 Pacific 1 310 0 Non-binary Unknown 120.0 STORAGE SHED 500.0 310.0 33.9586 -118.4485 Not solved
36 2021-12-25 2021-12-25 19:00:00 1 Central 1 210 53 Male White 101.0 STREET 500.0 210.0 34.0430 -118.2420 Not solved
37 2021-06-24 2021-06-24 02:35:00 6 Hollywood 1 210 40 Male Black 101.0 STREET 102.0 210.0 34.0998 -118.3288 Not solved
In [16]:
weapon_types_df.head(5)
Out[16]:
weapon_description
weapon_code
400 STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)
500 UNKNOWN WEAPON/OTHER WEAPON
102 HAND GUN
201 KNIFE WITH BLADE OVER 6 INCHES IN LENGTH
107 OTHER FIREARM
In [17]:
crime_types_df.head(5)
Out[17]:
crime_description
crime_code
480 BIKE - STOLEN
510 VEHICLE - STOLEN
350 THEFT, PERSON
440 THEFT PLAIN - PETTY ($950 & UNDER)
420 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)
In [18]:
print(weapon_types_df.columns)
print(crime_types_df.columns)
Index(['weapon_description'], dtype='object')
Index(['crime_description'], dtype='object')
In [19]:
# Merge with weapon types
crime_df = crime_df.merge(weapon_types_df, on='weapon_code', how='left')
# Merge with crime types
crime_df = crime_df.merge(crime_types_df, on='crime_code', how='left')
In [20]:
crime_df['weapon_code'] = crime_df['weapon_code'].astype('category')
crime_df['crime_code'] = crime_df['crime_code'].astype('category')
In [21]:
# Preview merged columns
print(crime_df[['weapon_code', 'weapon_description', 'crime_code', 'crime_description']].head())
  weapon_code                              weapon_description crime_code  \
0       400.0  STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)        210   
1       500.0                     UNKNOWN WEAPON/OTHER WEAPON        745   
2       500.0                     UNKNOWN WEAPON/OTHER WEAPON        310   
3       500.0                     UNKNOWN WEAPON/OTHER WEAPON        210   
4       102.0                                        HAND GUN        210   

                          crime_description  
0                                   ROBBERY  
1  VANDALISM - MISDEAMEANOR ($399 OR UNDER)  
2                                  BURGLARY  
3                                   ROBBERY  
4                                   ROBBERY  

Generate summary statistics¶

The summary statistics below provide an overview of the numerical and categorical features in the dataset, including victim demographics, crime counts by type and area, and the proportion of solved cases.

In [22]:
# Summary statistics for numeric columns
print("Numeric Summary Statistics:")
display(crime_df.describe())
Numeric Summary Statistics:
date_reported date_occurred victim_age latitude longitude
count 69479 69479 69479.000000 69479.000000 69479.000000
mean 2021-07-07 08:10:23.428661760 2021-07-05 20:22:01.095870720 34.621022 33.807383 -117.457399
min 2021-01-01 00:00:00 2021-01-01 00:01:00 0.000000 0.000000 -118.667200
25% 2021-04-12 00:00:00 2021-04-10 02:55:00 24.000000 33.996900 -118.399800
50% 2021-07-11 00:00:00 2021-07-09 12:45:00 33.000000 34.049900 -118.304900
75% 2021-10-02 00:00:00 2021-09-30 19:05:00 46.000000 34.110400 -118.269600
max 2021-12-31 00:00:00 2021-12-31 23:30:00 99.000000 34.334300 0.000000
std NaN NaN 17.686676 2.932096 10.180443
In [23]:
def get_iqr_bounds(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return lower, upper

lat_lower, lat_upper = get_iqr_bounds(crime_df['latitude'])
lon_lower, lon_upper = get_iqr_bounds(crime_df['longitude'])
print(f"Latitude bounds: {lat_lower:.5f} to {lat_upper:.5f}")
print(f"Longitude bounds: {lon_lower:.5f} to {lon_upper:.5f}")
Latitude bounds: 33.82665 to 34.28065
Longitude bounds: -118.59510 to -118.07430
In [24]:
outliers = crime_df[
    (crime_df['latitude'] < lat_lower) | (crime_df['latitude'] > lat_upper) |
    (crime_df['longitude'] < lon_lower) | (crime_df['longitude'] > lon_upper)
]

print(f"Total outliers found: {outliers.shape[0]}")
Total outliers found: 5615
In [25]:
crime_df = crime_df[
    (crime_df['latitude'] >= lat_lower) & (crime_df['latitude'] <= lat_upper) &
    (crime_df['longitude'] >= lon_lower) & (crime_df['longitude'] <= lon_upper)
].copy()

print(f"Rows after removing outliers: {crime_df.shape[0]}")
Rows after removing outliers: 63864
In [26]:
# Summary statistics for categorical/object columns
print("Categorical Summary Statistics:")
display(crime_df.describe(include=['category', 'object']))
Categorical Summary Statistics:
area area_name part crime_code victim_gender victim_ethnicity premise_code premise_description weapon_code crime_code_1 case_solved weapon_description crime_description
count 63864 63864 63864 63864 63864 63864 63864.0 63864 63864.0 63864.0 63864 63864 63864
unique 21 21 2 100 5 12 256.0 255 73.0 100.0 2 73 100
top 12 77th Street 2 624 Male Hispanic 101.0 STREET 400.0 624.0 Not solved STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) BATTERY - SIMPLE ASSAULT
freq 6189 6189 34108 13980 31379 28763 14274.0 14274 33607.0 14039.0 38151 33607 13980
In [27]:
# Solve rate
print("Case Solved Rate:")
display(crime_df['case_solved'].value_counts(normalize=True))

# Top 5 most common crime types
print("Top 5 Crime Types:")
display(crime_df['crime_description'].value_counts().head())

# Top 5 areas by crime count
print("Top 5 Areas:")
display(crime_df['area_name'].value_counts().head())
Case Solved Rate:
case_solved
Not solved    0.597379
Solved        0.402621
Name: proportion, dtype: float64
Top 5 Crime Types:
crime_description
BATTERY - SIMPLE ASSAULT                          13980
ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT    11190
INTIMATE PARTNER - SIMPLE ASSAULT                  9122
ROBBERY                                            6661
CRIMINAL THREATS - NO WEAPON DISPLAYED             3655
Name: count, dtype: int64
Top 5 Areas:
area_name
77th Street    6189
Central        5138
Southeast      4691
Hollywood      4582
Southwest      4439
Name: count, dtype: int64

Visualize distributions and correlations¶

In [28]:
# Victim Age Distribution
plt.figure(figsize=(8, 4))
sns.histplot(crime_df['victim_age'].dropna(), bins=50, kde=True)
plt.title('Distribution of Victim Age')
plt.xlabel('Victim Age')
plt.ylabel('Count')
plt.xlim(0, 100)
plt.xticks(range(0, 101, 5), rotation=45)
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.show()
No description has been provided for this image
In [29]:
plt.figure(figsize=(10, 5))
order = crime_df['victim_ethnicity'].value_counts().index

# Get counts and percentages
counts = crime_df['victim_ethnicity'].value_counts()
percentages = counts / counts.sum() * 100

ax = sns.countplot(
    y='victim_ethnicity',
    data=crime_df,
    order=order,
    hue='victim_ethnicity',
    palette='Blues_r',
    legend=False
)
plt.title('Distribution of Victim Ethnicity')
plt.xlabel('Count')
plt.ylabel('Victim Ethnicity')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)
plt.tight_layout()

plt.show()
No description has been provided for this image
In [30]:
plt.figure(figsize=(7, 4))
order = crime_df['victim_gender'].value_counts().index

# Get counts and percentages
counts = crime_df['victim_gender'].value_counts()
percentages = counts / counts.sum() * 100

ax = sns.countplot(
    y='victim_gender',
    data=crime_df,
    order=order,
    hue='victim_gender',
    palette='Blues_r',
    legend=False
)
plt.title('Distribution of Victim Gender')
plt.xlabel('Count')
plt.ylabel('Victim Gender')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.tight_layout()
plt.show()
No description has been provided for this image
In [31]:
print(crime_df['crime_description'].isnull().sum())
print(crime_df['crime_description'].value_counts().head(10))
0
crime_description
BATTERY - SIMPLE ASSAULT                                   13980
ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT             11190
INTIMATE PARTNER - SIMPLE ASSAULT                           9122
ROBBERY                                                     6661
CRIMINAL THREATS - NO WEAPON DISPLAYED                      3655
BRANDISH WEAPON                                             2925
INTIMATE PARTNER - AGGRAVATED ASSAULT                       2614
VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)     1363
BURGLARY                                                    1149
ATTEMPTED ROBBERY                                           1029
Name: count, dtype: int64
In [32]:
# Calculate and print the top 10 for verification

top_crimes = crime_df['crime_description'].str.strip().value_counts().head(10)

top_crimes_df = top_crimes.reset_index()
top_crimes_df.columns = ['crime_description', 'count']

plt.figure(figsize=(10, 5))
sns.barplot(data=top_crimes_df, y='crime_description', x='count')
plt.title('Top 10 Crime Types')
plt.xlabel('Number of Crimes')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.ylabel('Crime Type')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [33]:
# Crimes by Area (Top 10)
plt.figure(figsize=(10, 5))
top_areas = crime_df['area_name'].str.strip().value_counts().head(10)
sns.barplot(y=top_areas.index, x=top_areas.values)
plt.title('Top 10 Areas by Crime Count')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.xlabel('Number of Crimes')
plt.ylabel('Area')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [34]:
# Select only numeric columns (drop non-numeric ones)
numeric_cols = crime_df.select_dtypes(include='number')

plt.figure(figsize=(8, 6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Heatmap for Numeric Features')
plt.show()
No description has been provided for this image

Analyze time- and location-based patterns¶

To identify trends, the dataset was explored across multiple time features (year, month, day of week). Results showed temporal cycles in crime volume, with certain days and months having more incidents. Geographic analysis highlighted concentrations in specific city areas, and mapping latitude/longitude revealed clusters and hotspots of criminal activity.

In [35]:
# Extract date time values from the date occurred
crime_df['year'] = crime_df['date_occurred'].dt.year
crime_df['month'] = crime_df['date_occurred'].dt.month
crime_df['day_of_week'] = crime_df['date_occurred'].dt.day_name()
crime_df['month_name'] = crime_df['date_occurred'].dt.month_name()
crime_df['hour'] = crime_df['date_occurred'].dt.hour
In [36]:
# Crimes per month (over all years)
monthly_counts = crime_df.groupby(['year', 'month']).size().reset_index(name='crime_count')
monthly_counts['year_month'] = monthly_counts['year'].astype(str) + '-' + monthly_counts['month'].astype(str).str.zfill(2)

plt.figure(figsize=(14, 5))
plt.plot(monthly_counts['year_month'], monthly_counts['crime_count'], marker='o')
plt.title('Crimes per Month')
plt.xlabel('Year-Month')
plt.ylabel('Number of Crimes')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [37]:
# Count crimes per month (all years)
month_counts = crime_df['month_name'].value_counts().reindex([
    'January','February','March','April','May','June','July','August','September','October','November','December'
])

plt.figure(figsize=(10,5))
sns.barplot(x=month_counts.index, y=month_counts.values)
plt.title('Total Crimes by Month (All Years Combined)')
plt.ylabel('Number of Crimes')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.show()
No description has been provided for this image
In [38]:
plt.figure(figsize=(8, 4))
sns.countplot(data=crime_df, x='day_of_week', order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.title('Crime Count by Day of the Week')
plt.ylabel('Number of Crimes')
plt.xlabel('Day of the Week')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.show()
No description has been provided for this image
In [39]:
# Count crimes per hour
hour_counts = crime_df['hour'].value_counts().sort_index()

plt.figure(figsize=(10,5))
sns.barplot(x=hour_counts.index, y=hour_counts.values, hue=hour_counts.index, legend=False, palette='viridis')
plt.title('Crime Count by Hour of the Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Crimes')
plt.xticks(range(0,24))
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)  # Adds major/minor grids
plt.tight_layout()
plt.show()
No description has been provided for this image
In [40]:
# Day of week
weekday_counts = crime_df['day_of_week'].value_counts()
weekday_max = weekday_counts.idxmax()
weekday_min = weekday_counts.idxmin()

# Hour of day
hour_counts = crime_df['hour'].value_counts()
hour_max = hour_counts.idxmax()
hour_min = hour_counts.idxmin()

# Month
month_counts = crime_df['month_name'].value_counts()
month_max = month_counts.idxmax()
month_min = month_counts.idxmin()

summary_table = pd.DataFrame({
    'Time Feature': ['Weekday', 'Hour', 'Month'],
    'Highest Crime': [f"{weekday_max} ({weekday_counts[weekday_max]})",
                      f"{hour_max} ({hour_counts[hour_max]})",
                      f"{month_max} ({month_counts[month_max]})"],
    'Lowest Crime': [f"{weekday_min} ({weekday_counts[weekday_min]})",
                     f"{hour_min} ({hour_counts[hour_min]})",
                     f"{month_min} ({month_counts[month_min]})"]
})

print(summary_table)
  Time Feature   Highest Crime     Lowest Crime
0      Weekday  Sunday (10093)   Tuesday (8304)
1         Hour       20 (3791)         5 (1082)
2        Month     July (6100)  February (4580)
In [41]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='longitude', y='latitude', data=crime_df, alpha=0.1, s=5)
plt.title('Crime Locations (All Data)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()
No description has been provided for this image

The interactive map is centered at the median latitude and longitude of all incidents, corresponding to the area with the highest density of reported crimes.

In [42]:
# Calculate median latitude and longitude (good if your data has outliers)
center_lat = crime_df['latitude'].median()
center_lon = crime_df['longitude'].median()

crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=10)

# Plot a sample (for performance)
sample_df = crime_df[['latitude', 'longitude']].dropna().sample(500, random_state=1)

for idx, row in sample_df.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=3,
        color='red',
        fill=True,
        fill_opacity=0.3
    ).add_to(crime_map)

folium.Marker(
    location=[center_lat, center_lon],
    popup='Hotspot Center',
    icon=folium.Icon(color='blue', icon='info-sign')
).add_to(crime_map)

crime_map
Out[42]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The following heatmap visualizes the density of crime incidents across the city. Red and yellow areas indicate hotspots with the highest concentrations of reported crimes, centered at the area of greatest activity.

In [43]:
# Center map at the median
center_lat = crime_df['latitude'].median()
center_lon = crime_df['longitude'].median()

crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=10)

# Prepare the data (drop missing and sample for speed)
heat_data = crime_df[['latitude', 'longitude']].dropna().sample(2000, random_state=1).values.tolist()

# Add heatmap layer
HeatMap(heat_data, radius=15, blur=5, min_opacity=0.2).add_to(crime_map)

crime_map
Out[43]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Explore solve rate by various features¶

In [44]:
# Bin victim_age
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
age_labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90+']

crime_df['victim_age_group'] = pd.cut(crime_df['victim_age'], bins=age_bins, labels=age_labels, right=False)

def get_top_bottom_solve_rates(crime_df, feature, solved_label='Solved', n=3):
    # Calculate solve rates
    solve_by_feat = (
        crime_df.groupby(feature, observed=True)['case_solved']
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
    )
    # Sort by solve rate
    sorted_solve = solve_by_feat.sort_values(by=solved_label, ascending=False)
    # Get top n and bottom n (with solve rate > 0)
    top_n = sorted_solve[sorted_solve[solved_label] > 0].head(n)
    bottom_n = sorted_solve[sorted_solve[solved_label] > 0].tail(n)
    return top_n[[solved_label]].reset_index(), bottom_n[[solved_label]].reset_index()

# Features to analyze
features = [
    ('crime_description', 'Crime Type'),
    ('weapon_description', 'Weapon'),
    ('victim_ethnicity', 'Victim Ethnicity'),
    ('victim_gender', 'Victim Gender'),
    ('area_name', 'Area'),
    ('victim_age_group', 'Victim Age Group')
]

# Build summary DataFrame
summary_rows = []
for feat_col, feat_label in features:
    top, bottom = get_top_bottom_solve_rates(crime_df, feat_col)
    # Add top
    for i, row in top.iterrows():
        summary_rows.append({
            'Feature': feat_label,
            'Category': row[feat_col],
            'Solve Rate': round(row['Solved'], 3),
            'Rank': 'Top'
        })
    # Add bottom
    for i, row in bottom.iterrows():
        summary_rows.append({
            'Feature': feat_label,
            'Category': row[feat_col],
            'Solve Rate': round(row['Solved'], 3),
            'Rank': 'Bottom'
        })

summary_table = pd.DataFrame(summary_rows)
# Show top 3 and bottom 3 per feature in a pretty table
summary_table = summary_table.pivot_table(
    index=['Feature', 'Rank'], 
    values=['Category', 'Solve Rate'], 
    aggfunc=lambda x: list(x)
).reset_index()

# Display the summary table
print(summary_table.to_string(index=False))
         Feature   Rank                                                                                        Category            Solve Rate
            Area Bottom                                                                 [Hollywood, Pacific, Southeast]  [0.299, 0.294, 0.27]
            Area    Top                                                             [Mission, West Valley, N Hollywood] [0.566, 0.564, 0.547]
      Crime Type Bottom [THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER), THEFT PLAIN - ATTEMPT, BURGLARY FROM VEHICLE]   [0.106, 0.1, 0.058]
      Crime Type    Top                                                    [LYNCHING, PROWLER, MANSLAUGHTER, NEGLIGENT]       [1.0, 1.0, 1.0]
Victim Age Group Bottom                                                                             [10-19, 60-69, 0-9]  [0.375, 0.372, 0.33]
Victim Age Group    Top                                                                           [80-89, 20-29, 30-39]  [0.454, 0.428, 0.42]
Victim Ethnicity Bottom                                                                  [Vietnamese, Unknown, Chinese] [0.333, 0.283, 0.167]
Victim Ethnicity    Top                                                                    [Japanese, Filipino, Korean] [0.556, 0.545, 0.495]
   Victim Gender Bottom                                                                     [Unknown, Male, Non-binary] [0.455, 0.362, 0.284]
   Victim Gender    Top                                                                   [Homosexual, Female, Unknown]   [0.5, 0.462, 0.455]
          Weapon Bottom                        [UNKNOWN TYPE CUTTING INSTRUMENT, UNKNOWN FIREARM, SEMI-AUTOMATIC RIFLE] [0.174, 0.143, 0.143]
          Weapon    Top                                                                   [KITCHEN KNIFE, SWORD, GLASS] [0.668, 0.649, 0.637]

Visualization¶

Use effective visuals to support your analysis.

  • Include charts such as bar plots, feature importance, and heatmaps
  • Make all visualizations interpretable and relevant
  • Label axes, legends, and titles clearly
In [45]:
def get_top10_solve_rate(crime_df, group_col, solved_label='Solved'):
    solve_by_feature = (
        crime_df.groupby(group_col, observed=True)['case_solved']
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
        .sort_values(by=solved_label, ascending=False)
    )
    top10 = solve_by_feature[solved_label].head(10).reset_index()
    top10.columns = [group_col, 'solve_rate']
    return top10
In [46]:
# Add to features
features = [
    ('crime_description',   'Solve Rate by Crime Type (Top 10)',      'Crime Type'),
    ('weapon_description',  'Solve Rate by Weapon (Top 10)',          'Weapon'),
    ('victim_ethnicity',    'Solve Rate by Victim Ethnicity (Top 10)','Victim Ethnicity'),
    ('victim_gender',       'Solve Rate by Victim Gender (Top 10)',   'Victim Gender'),
    ('area_name',           'Solve Rate by Area (Top 10)',            'Area'),
    ('victim_age_group',    'Solve Rate by Victim Age Group (Top 10)', 'Victim Age Group')
]
dfs = [get_top10_solve_rate(crime_df, f[0]) for f in features]

# Make subplots
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=[f[1] for f in features],
    shared_xaxes=False, shared_yaxes=False,
    horizontal_spacing=0.20
)

for idx, (df, (col, title, ylabel)) in enumerate(zip(dfs, features)):
    row = idx // 2 + 1
    col_num = idx % 2 + 1
    fig.add_trace(
        go.Bar(
            x=df['solve_rate'],
            y=df[col],
            orientation='h',
            marker_color='steelblue',
            text=(df['solve_rate']*100).round(1).astype(str)+'%',
            textposition='auto',
        ),
        row=row, col=col_num
    )
    fig.update_yaxes(title_text=ylabel, row=row, col=col_num)
    fig.update_xaxes(range=[0, 1], title_text='Solve Rate', row=row, col=col_num)

fig.update_layout(
    height=1200, width=1000,
    title_text="Solve Rate by Feature (Top 10)",
    showlegend=False,
    font=dict(size=8),
    margin=dict(l=0, r=0, t=60, b=5),
)
for i in range(1, len(features) + 1):
    fig['layout']['annotations'][i-1]['font'] = dict(size=12)

fig.show()

Feature Importance (using a simple model)

In [47]:
importances = None

# Prepare data
features_for_model = ['crime_description', 'weapon_description', 'victim_ethnicity',
                      'victim_gender', 'area_name', 'victim_age_group']
df_model = crime_df.dropna(subset=features_for_model + ['case_solved'])

# Encode categorical features numerically
df_model_encoded = df_model.copy()
label_encoders = {}
for col in features_for_model + ['case_solved']:
    le = LabelEncoder()
    df_model_encoded[col] = le.fit_transform(df_model_encoded[col].astype(str))
    label_encoders[col] = le

X = df_model_encoded[features_for_model]
y = df_model_encoded['case_solved']

# Fit model
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)
model = RandomForestClassifier(n_estimators=100, random_state=1)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_

imp_df = pd.DataFrame({'feature': features_for_model, 'importance': importances}).sort_values('importance', ascending=False)

fig = px.bar(imp_df, x='importance', y='feature', orientation='h',
             title='Feature Importance for Predicting Case Solved',
             text=imp_df['importance'].round(3))
fig.update_traces(textposition='outside')
fig.update_layout(yaxis_title='Feature', xaxis_title='Importance')
fig.show()
In [48]:
numeric_cols = crime_df.select_dtypes(include='number')
corr = numeric_cols.corr()

fig = ff.create_annotated_heatmap(
    z=corr.values,
    x=list(corr.columns),
    y=list(corr.index),
    annotation_text=corr.round(2).values,
    colorscale='Viridis'
)
fig.update_layout(title_text='Correlation Heatmap of Numeric Features', width=800, height=800)
fig.show()

Feature Engineering¶

Create or refine features to enhance model quality.

  • Transform or derive features
  • Encode categorical variables
  • Normalize or scale features as appropriate
In [49]:
crime_df.dtypes
Out[49]:
date_reported          datetime64[ns]
date_occurred          datetime64[ns]
area                         category
area_name                    category
part                         category
crime_code                   category
victim_age                      int32
victim_gender                  object
victim_ethnicity               object
premise_code                 category
premise_description          category
weapon_code                  category
crime_code_1                 category
latitude                      float64
longitude                     float64
case_solved                  category
weapon_description           category
crime_description            category
year                            int32
month                           int32
day_of_week                    object
month_name                     object
hour                            int32
victim_age_group             category
dtype: object

Several features were engineered or transformed to enhance model quality and analysis. Categorical variables (such as gender and ethnicity) were cleaned for consistency. New time-based features—like year, month, hour, and day of week—were derived from date columns to support temporal analysis. Victim age was grouped into categorical bins to simplify trends by age segment. Code columns were joined with their descriptions for greater interpretability, and redundant columns were removed. All features were cast to their appropriate data types to ensure optimal performance for modeling and visualization.

Transform or Derive Features¶

Already derived:

  • year
  • month
  • day_of_week
  • hour
  • victim_age_group

Additional:

  • is_weekend (binary: Saturday/Sunday)
  • day_of_year
  • season
  • Interaction Features Victim Gender/Ethnicity
  • Interaction Features Area/Crime Type
In [50]:
# Weekend indicator
crime_df['is_weekend'] = crime_df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)

# Season (basic example)
crime_df['season'] = crime_df['month'].map({12:'Winter', 1:'Winter', 2:'Winter',
                                            3:'Spring', 4:'Spring', 5:'Spring',
                                            6:'Summer', 7:'Summer', 8:'Summer',
                                            9:'Fall', 10:'Fall', 11:'Fall'})

# Day of the year
crime_df['day_of_year'] = crime_df['date_occurred'].dt.dayofyear

# Interaction Features Victim Gender/Ethnicity
crime_df['victim_gender_ethnicity'] = (
    crime_df['victim_gender'].astype(str) + '_' + crime_df['victim_ethnicity'].astype(str)
)

# Interaction Features Area/Crime Type
crime_df['area_crime_type'] = (
    crime_df['area_name'].astype(str) + '_' + crime_df['crime_description'].astype(str)
)

Encode Categorical Variables¶

  • For tree-based models (RandomForest, XGBoost): Label encoding is fine for most variables.
  • For linear models or neural nets: use one-hot encoding or ordinal encoding.
In [51]:
# Create a copy for modeling/encoding
crime_model_df = crime_df.copy()

cat_cols = [
    'area', 'area_name', 'part', 'crime_code', 'victim_gender', 'victim_ethnicity',
    'premise_code', 'premise_description', 'weapon_code', 'crime_code_1','day_of_week', 'month_name',
    'case_solved', 'weapon_description', 'crime_description', 'victim_age_group', 'season',
    'victim_gender_ethnicity', 'area_crime_type'
]

label_encoders = {}
for col in cat_cols:
    if col in crime_model_df.columns:
        le = LabelEncoder()
        crime_model_df[col] = le.fit_transform(crime_model_df[col].astype(str))
        label_encoders[col] = le
In [52]:
crime_model_df.dtypes
Out[52]:
date_reported              datetime64[ns]
date_occurred              datetime64[ns]
area                                int32
area_name                           int32
part                                int32
crime_code                          int32
victim_age                          int32
victim_gender                       int32
victim_ethnicity                    int32
premise_code                        int32
premise_description                 int32
weapon_code                         int32
crime_code_1                        int32
latitude                          float64
longitude                         float64
case_solved                         int32
weapon_description                  int32
crime_description                   int32
year                                int32
month                               int32
day_of_week                         int32
month_name                          int32
hour                                int32
victim_age_group                    int32
is_weekend                          int32
season                              int32
day_of_year                         int32
victim_gender_ethnicity             int32
area_crime_type                     int32
dtype: object
In [53]:
crime_model_df.head()
Out[53]:
date_reported date_occurred area area_name part crime_code victim_age victim_gender victim_ethnicity premise_code ... month day_of_week month_name hour victim_age_group is_weekend season day_of_year victim_gender_ethnicity area_crime_type
0 2021-08-21 2021-08-21 23:51:00 3 0 0 4 51 0 1 0 ... 8 2 1 23 5 1 2 233 1 51
1 2021-02-20 2021-02-20 22:50:00 11 13 1 56 0 0 9 20 ... 2 2 3 22 0 1 3 51 9 826
2 2021-07-29 2021-07-28 21:00:00 5 12 0 13 0 3 9 18 ... 7 6 5 21 0 0 2 209 31 719
3 2021-12-25 2021-12-25 19:00:00 0 1 0 4 53 2 11 0 ... 12 2 2 19 5 1 3 359 27 122
4 2021-06-24 2021-06-24 02:35:00 17 6 0 4 40 2 1 0 ... 6 4 6 2 4 0 2 175 17 388

5 rows × 29 columns

Normalize or Scale Features¶

  • Scale numeric features (victim_age, latitude, longitude, hour, etc.) for models sensitive to feature magnitude (logistic regression, neural nets).
  • Tree-based models do not require feature scaling, but it can help with KNN, SVM, etc.

Only scale the columns that are naturally continuous and exclude label-encoded categoricals.

In [54]:
numeric_cols = ['victim_age', 'latitude', 'longitude', 'year', 'month', 'hour', 'day_of_year']
scaler = StandardScaler()
crime_model_df[numeric_cols] = scaler.fit_transform(crime_model_df[numeric_cols])

Model Building¶

Build classification models to predict whether a case will be solved.

  • Use at least two models (such as Logistic Regression, Random Forest)
  • Perform a proper train-test split
  • Apply appropriate preprocessing (e.g., scaling, encoding)
  • Tune hyperparameters if needed

Prepare Features and Target¶

Choose your features (exclude non-numeric identifiers, text, or unused columns)

In [55]:
# Drop ID/time columns if not used in modeling
feature_cols = [
    'victim_age', 'latitude', 'longitude', 'hour', 'year', 'month', 'day_of_year',
    'area', 'area_name', 'part', 'crime_code', 'victim_gender', 'victim_ethnicity',
    'premise_code', 'premise_description', 'weapon_code', 'crime_code_1',
    'weapon_description', 'crime_description', 'victim_age_group',
    'is_weekend', 'season', 'victim_gender_ethnicity', 'area_crime_type'
]

X = crime_model_df[feature_cols].copy()
y = crime_model_df['case_solved'].copy()

Train-Test-Validation Split¶

Let’s do a 60/20/20 split:

  • 60% train
  • 20% validation
  • 20% test
In [56]:
# First split: Train+Validation vs. Test
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
# Second split: Train vs. Validation
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, stratify=y_trainval, random_state=42)
# (Now: 60% train, 20% validation, 20% test)

print(f"Train X (Independent Features): {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")
print(f"Train y (Dependent Variable): {y_train.shape}, Val: {y_val.shape}, Test: {y_test.shape}")
Train X (Independent Features): (38318, 24), Val: (12773, 24), Test: (12773, 24)
Train y (Dependent Variable): (38318,), Val: (12773,), Test: (12773,)

Model 1: Logistic Regression¶

Increased max_iter to 10,000 for Logistic Regression to ensure convergence with a high number of features and categorical variables.

In [57]:
# Main parameters shown: penalty, C, solver, max_iter, random_state
logreg = LogisticRegression(
    penalty='l2',          # Regularization type
    C=1.0,                 # Inverse of regularization strength
    solver='lbfgs',        # Optimization algorithm
    max_iter=10000,         # Max number of iterations
    random_state=42        # Seed
)
logreg.fit(X_train, y_train)

print("Logistic Regression fit parameters:")
print(logreg.get_params())
Logistic Regression fit parameters:
{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Model 2: Random Forest¶

In [58]:
# Main parameters shown: n_estimators, max_depth, min_samples_split, random_state
rf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=None,        # Max tree depth
    min_samples_split=2,   # Min samples to split an internal node
    random_state=42,       # Seed
    n_jobs=-1              # Use all CPU cores
)
rf.fit(X_train, y_train)

print("Random Forest fit parameters:")
print(rf.get_params())
Random Forest fit parameters:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': -1, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}

Initial Validation & Test¶

In [59]:
# Validation
y_val_pred_logreg = logreg.predict(X_val)
y_val_pred_rf     = rf.predict(X_val)

print("LogReg Validation Accuracy:", accuracy_score(y_val, y_val_pred_logreg))
print("RF Validation Accuracy:", accuracy_score(y_val, y_val_pred_rf))

# Final test can be performed after selecting the best model
LogReg Validation Accuracy: 0.6323494871995615
RF Validation Accuracy: 0.7042198387223049

Evaluate how well your models perform¶

  • Use metrics such as accuracy, precision, recall, F1-score, AUC-ROC
  • You can include confusion matrices and interpret results
  • Discuss the effect of class imbalance and possible mitigation
  • Compare model performance meaningfully

Generate All Key Metric¶

In [60]:
# Predict on validation and test sets (use your models: logreg, rf)
y_val_pred_logreg = logreg.predict(X_val)
y_val_pred_rf     = rf.predict(X_val)
y_test_pred_logreg = logreg.predict(X_test)
y_test_pred_rf     = rf.predict(X_test)

# For AUC, need predicted probabilities (for positive class)
y_val_proba_logreg = logreg.predict_proba(X_val)[:,1]
y_val_proba_rf     = rf.predict_proba(X_val)[:,1]
y_test_proba_logreg = logreg.predict_proba(X_test)[:,1]
y_test_proba_rf     = rf.predict_proba(X_test)[:,1]

Display Metrics¶

In [61]:
def print_metrics(y_true, y_pred, y_proba, model_name="Model"):
    print(f"\n{model_name} Performance:")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall   :", recall_score(y_true, y_pred))
    print("F1-score :", f1_score(y_true, y_pred))
    print("AUC-ROC  :", roc_auc_score(y_true, y_proba))
    print(classification_report(y_true, y_pred))

print_metrics(y_val, y_val_pred_logreg, y_val_proba_logreg, "Logistic Regression (Val)")
print_metrics(y_val, y_val_pred_rf, y_val_proba_rf, "Random Forest (Val)")

print_metrics(y_test, y_test_pred_logreg, y_test_proba_logreg, "Logistic Regression (Test)")
print_metrics(y_test, y_test_pred_rf, y_test_proba_rf, "Random Forest (Test)")
Logistic Regression (Val) Performance:
Accuracy : 0.6323494871995615
Precision: 0.579849946409432
Recall   : 0.3155745673731285
F1-score : 0.4087131704860237
AUC-ROC  : 0.6499525115128045
              precision    recall  f1-score   support

           0       0.65      0.85      0.73      7630
           1       0.58      0.32      0.41      5143

    accuracy                           0.63     12773
   macro avg       0.61      0.58      0.57     12773
weighted avg       0.62      0.63      0.60     12773


Random Forest (Val) Performance:
Accuracy : 0.7042198387223049
Precision: 0.6690611840475601
Recall   : 0.5251798561151079
F1-score : 0.5884531590413943
AUC-ROC  : 0.7548723162379027
              precision    recall  f1-score   support

           0       0.72      0.82      0.77      7630
           1       0.67      0.53      0.59      5143

    accuracy                           0.70     12773
   macro avg       0.69      0.68      0.68     12773
weighted avg       0.70      0.70      0.70     12773


Logistic Regression (Test) Performance:
Accuracy : 0.6248336334455492
Precision: 0.5600821636425881
Recall   : 0.3181022749368073
F1-score : 0.40575396825396826
AUC-ROC  : 0.6369829558760982
              precision    recall  f1-score   support

           0       0.64      0.83      0.73      7630
           1       0.56      0.32      0.41      5143

    accuracy                           0.62     12773
   macro avg       0.60      0.57      0.57     12773
weighted avg       0.61      0.62      0.60     12773


Random Forest (Test) Performance:
Accuracy : 0.6923980270883896
Precision: 0.650844930417495
Recall   : 0.5092358545595955
F1-score : 0.5713974037307734
AUC-ROC  : 0.7365395813419047
              precision    recall  f1-score   support

           0       0.71      0.82      0.76      7630
           1       0.65      0.51      0.57      5143

    accuracy                           0.69     12773
   macro avg       0.68      0.66      0.67     12773
weighted avg       0.69      0.69      0.68     12773

Confusion Matrix Plots¶

In [62]:
def plot_confusion(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(4,4))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(title)
    plt.show()

plot_confusion(y_val, y_val_pred_logreg, "LogReg Validation Confusion Matrix")
plot_confusion(y_val, y_val_pred_rf, "Random Forest Validation Confusion Matrix")
No description has been provided for this image
No description has been provided for this image

ROC Curves¶

In [63]:
RocCurveDisplay.from_estimator(logreg, X_val, y_val)
plt.title("Logistic Regression ROC Curve (Validation)")
plt.show()

RocCurveDisplay.from_estimator(rf, X_val, y_val)
plt.title("Random Forest ROC Curve (Validation)")
plt.show()
No description has been provided for this image
No description has been provided for this image
In [64]:
print(y.value_counts(normalize=True))
case_solved
0    0.597379
1    0.402621
Name: proportion, dtype: float64

Insights, Interpretation, and Reporting¶

Communicate your findings clearly, professionally, and meaningfully.

  • Include an executive summary outlining key findings and recommendations
  • Clearly explain what factors most influence case solvability
  • Highlight any spatial, temporal, or demographic patterns
  • Suggest improvements or interventions based on your results
  • Reflect on limitations of your analysis and model
  • Structure your report into logical, readable sections with explanations

Executive Summary¶

This project aimed to predict the solvability of crime cases using a comprehensive dataset of reported incidents, enriched by extensive feature engineering and machine learning analysis.

Key findings:

  • Ethnic Distribution: The dataset reflects a predominance of Hispanic and Black victims, which may be linked to the underlying population distribution, reporting behaviour, or exposure to risk.
  • Gender Distribution: Both male and female victims are prevalent, but males are somewhat more common in this dataset.
  • Age Distribution: Victims are most commonly young adults, with a steep decline in frequency among older populations.
  • Random Forest achieved a validation accuracy of 70.4% and an AUC-ROC of 0.75, outperforming Logistic Regression on all metrics.
  • The most influential factors for case solvability included spatial (area), temporal (hour, day of week), demographic (victim age, gender, ethnicity), and crime-type features.
  • Targeted interventions are recommended for districts and periods with lower solve rates, and for crime types and demographics with historically lower solvability.
  • Limitations include missing values, possible unobserved variables, and moderate class imbalance.

Factors Influencing Case Solvability¶

Analysis of feature importance using Random Forest and SHAP summary plots reveals that:

  • Area name and crime description are the top predictors of case solvability.
  • Temporal factors such as hours and days of the week have a strong influence, with cases reported during the day and on weekdays being more likely to be solved.
  • Victim demographics—especially the interaction of gender and ethnicity play a notable role, indicating the need for demographic-sensitive investigative strategies.
  • Weapon and crime type further segment solvability, with certain violent crimes or weapon-involved incidents showing lower solve rates.

Spatial, Temporal, and Demographic Patterns¶

  • Spatial Patterns:

Heatmap visualizations show clusters of both solved and unsolved cases. Some areas consistently outperform others in case resolution (see Table 2: Top 10 Areas by Solve Rate).

  • Temporal Patterns:

Bar plots illustrate that crimes occurring during certain hours (especially early mornings and weekdays) have higher resolution rates. Seasonal analysis supports increased solvability in spring/summer.

  • Demographic Patterns:

The “Solve Rate by Victim Age Group” and detailed bar plots for ethnicity and gender show age and combined demographics as significant, with middle-aged victims and certain demographic profiles more likely to have cases solved.

  • Victim Demographics: Ethnicity, Gender, and Age Distribution

Ethnicity As shown in the Distribution of Victim Ethnicity bar plot, the dataset is predominantly composed of victims identified as Hispanic, Black, and White. Hispanic victims represent the largest group, with close to 29,000 cases, followed by Black (about 15,000) and White (approximately 10,000). Other ethnicities—including Asian, Korean, Filipino, Indigenous, Japanese, Chinese, and Vietnamese—are present in smaller numbers, each with fewer than 5,000 reported victims. The "Other" and "Unknown" categories also account for a notable share, highlighting the presence of missing or non-standardized data.

  • Gender

The Distribution of Victim Gender plot indicates that the majority of crime victims in the dataset are Male and Female, with males slightly outnumbering females (over 31,000 vs. nearly 29,000 cases, respectively). Non-binary and unknown gender categories are present but represent a much smaller fraction of the records. There is also a minimal number of cases identified as "Homosexual" in the gender variable, which may reflect unique reporting conventions or data entry inconsistencies.

  • Age

The Distribution of Victim Age histogram reveals a wide age range among victims, but with pronounced peaks in the 20–35 year age bracket. There is a sharp drop-off in victim counts for ages above 60, and a large spike at age 0, possibly reflecting the way infants or unknown ages are coded. The distribution skews toward younger adults, with the median age likely falling in the late twenties to early thirties.

Recommendations and Suggested Interventions¶

  • Resource Allocation: Prioritize investigative resources in low-performing districts and peak hours identified.
  • Training & Policy: Use insights from feature importance to train officers on cases more likely to remain unsolved, such as those involving certain crime types, weapons, or demographics.
  • Data Collection: Improve data completeness, especially for weapon codes and victim demographic information.
  • Analytical Expansion: Incorporate external factors (e.g., weather, special events) in future models to boost prediction accuracy.
  • Class Imbalance: Continue exploring balanced class weighting and resampling.

Limitations¶

  • Missing and Noisy Data: Several fields had missing or inconsistent entries, requiring removal or imputation.
  • Feature Engineering: While new features (e.g., victim_age_group, season, area_crime_type) improved model performance, some potentially predictive attributes may be absent from the current dataset.
  • Model Generalizability: The best-performing Random Forest model may overfit specific patterns in this dataset; performance on other cities or years may differ.
  • Class Imbalance: Although moderate, it still affected recall for the “solved” class; further SMOTE or class weighting could help.

Conclusions¶

This analysis successfully demonstrated how data-driven approaches can be applied to predict the solvability of crime cases using real-world law enforcement data. By thoroughly cleaning and enriching the dataset, performing detailed exploratory analysis, and evaluating advanced classification models, we uncovered valuable patterns and actionable insights.

  • Demographic patterns provide important context for model interpretation and for developing prevention or intervention strategies. For example, targeted outreach, resource allocation, or investigative strategies may be more effective if informed by the prevailing victim profiles in the data. Additionally, understanding demographic skews can help identify possible data collection or reporting biases.

  • Predictive Modelling:

Random Forest models, after careful feature engineering and data preparation, provided the highest accuracy and recall for identifying solved cases, outperforming simpler linear approaches like Logistic Regression. This indicates the presence of complex, nonlinear relationships in the factors affecting case outcomes.

  • Key Influences:

The analysis revealed that spatial (area/district), temporal (hour, day of week, season), demographic (victim age, gender, ethnicity), and crime-type features are among the most significant predictors of case solvability. Interaction features further enhanced model performance, confirming that multifaceted relationships exist within the data.

  • Pattern Recognition:

Spatial heatmaps, time-based trend analyses, and demographic group bar plots helped reveal which neighbourhoods, periods, and victim profiles are associated with higher or lower case resolution rates.

  • Practical Insights:

Results suggest that focusing investigative resources on low-performing districts, peak hours, and certain demographic or crime types may improve overall case solvability. Improved data collection, especially regarding victim and weapon characteristics, can further enhance prediction and operational strategies.

  • Model Limitations:

The analysis is subject to limitations such as missing or incomplete data, moderate class imbalance, and the possible exclusion of relevant external factors (e.g., investigative effort, socioeconomic context). Model generalizability to other regions or future datasets may be limited without further validation.

  • Future Directions:

Additional improvements could include integrating more diverse data sources, experimenting with more advanced ensemble or deep learning methods, and developing interpretable model explanations (such as SHAP plots) for operational use.

Overall, this notebook provides a strong foundation for using predictive analytics to support decision-making in law enforcement, offering both immediate practical recommendations and a roadmap for continued analytical development.