This final part is mandatory and must be submitted. It focuses on analyzing and modeling the LAPD Crime Case Dataset (crime_dataset.csv) provided to you.
In addition, two lookup files are included to help you interpret and enrich the categorical codes related to crime types and weapon types. You are expected to use them appropriately as part of your analysis.
Your goal is to apply the full range of skills covered in this course, including data processing, analysis, and machine learning. You are expected to deliver a structured analysis and a final report containing your findings, supported by data, visualizations, and code.
Data Cleaning¶
Prepare the dataset for analysis by resolving common data quality issues.
- Identify and handle missing values
- Remove or review duplicates and inconsistencies
- Ensure correct data types
- Normalize inconsistent categorical values
Install additional libraries¶
!pip install folium
!pip install branca
Requirement already satisfied: folium in c:\users\mzamu\anaconda3\lib\site-packages (0.20.0) Requirement already satisfied: branca>=0.6.0 in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (0.8.1) Requirement already satisfied: jinja2>=2.9 in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (3.1.4) Requirement already satisfied: numpy in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (1.26.4) Requirement already satisfied: requests in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (2.32.3) Requirement already satisfied: xyzservices in c:\users\mzamu\anaconda3\lib\site-packages (from folium) (2022.9.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\mzamu\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in c:\users\mzamu\anaconda3\lib\site-packages (from requests->folium) (2025.4.26) Requirement already satisfied: branca in c:\users\mzamu\anaconda3\lib\site-packages (0.8.1) Requirement already satisfied: jinja2>=3 in c:\users\mzamu\anaconda3\lib\site-packages (from branca) (3.1.4) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\mzamu\anaconda3\lib\site-packages (from jinja2>=3->branca) (2.1.3)
Imports¶
# Data manipulation libraries
import numpy as np
import pandas as pd
# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
# Machine learning libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score,
classification_report,
confusion_matrix,
f1_score,
precision_score,
recall_score,
roc_auc_score,
RocCurveDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Load datasets
crime_df = pd.read_csv('crime_dataset.csv')
weapon_types_df = pd.read_csv('weapon_types.csv')
crime_types_df = pd.read_csv('crime_types.csv')
# Quick preview
print("Crime Dataset:")
display(crime_df.shape)
display(crime_df.head())
print("\nWeapon Types:")
display(weapon_types_df.shape)
display(weapon_types_df.head())
print("\nCrime Types:")
display(crime_types_df.shape)
display(crime_types_df.head())
Crime Dataset:
(203089, 25)
division_number | date_reported | date_occurred | area | area_name | reporting_district | part | crime_code | modus_operandi | victim_age | ... | crime_code_1 | crime_code_2 | crime_code_3 | crime_code_4 | incident_admincode | location | cross_street | latitude | longitude | case_solved | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 211414090 | 2021-06-27 | 2021-06-20 20:00:00 | 14 | Pacific | 1464 | 1 | 480 | 0344 | 32.0 | ... | 480.0 | NaN | NaN | NaN | 0 | 12400 FIELDING | NaN | 33.9791 | -118.4092 | Not solved |
1 | 210504861 | 2021-01-22 | 2021-01-21 22:00:00 | 5 | Harbor | 515 | 1 | 510 | NaN | 0.0 | ... | 510.0 | NaN | NaN | NaN | 1 | 1500 BAY VIEW AV | NaN | 33.7929 | -118.2710 | Solved |
2 | 210104843 | 2021-01-21 | 2021-01-21 02:00:00 | 1 | Central | 139 | 1 | 510 | NaN | 0.0 | ... | 510.0 | NaN | NaN | NaN | 1 | 300 S SANTA FE AV | NaN | 34.0420 | -118.2326 | Solved |
3 | 210115564 | 2021-08-22 | 2021-08-22 07:00:00 | 1 | Central | 151 | 1 | 350 | 1308 0344 0345 1822 | 29.0 | ... | 350.0 | NaN | NaN | NaN | 0 | 7TH | FIGUEROA | 34.0496 | -118.2603 | Not solved |
4 | 211421187 | 2021-11-09 | 2021-11-07 19:00:00 | 14 | Pacific | 1465 | 1 | 510 | NaN | 0.0 | ... | 510.0 | NaN | NaN | NaN | 0 | 5500 MESMER AV | NaN | 33.9869 | -118.4022 | Not solved |
5 rows × 25 columns
Weapon Types:
(73, 2)
weapon_code | weapon_description | |
---|---|---|
0 | 400 | STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) |
1 | 500 | UNKNOWN WEAPON/OTHER WEAPON |
2 | 102 | HAND GUN |
3 | 201 | KNIFE WITH BLADE OVER 6 INCHES IN LENGTH |
4 | 107 | OTHER FIREARM |
Crime Types:
(133, 2)
crime_code | crime_description | |
---|---|---|
0 | 480 | BIKE - STOLEN |
1 | 510 | VEHICLE - STOLEN |
2 | 350 | THEFT, PERSON |
3 | 440 | THEFT PLAIN - PETTY ($950 & UNDER) |
4 | 420 | THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) |
Identify and Handle Missing Values¶
# Check missing values in each dataset
print("Missing values in crime_df:\n", crime_df.isnull().sum())
print("\nMissing values in weapon_types_df:\n", weapon_types_df.isnull().sum())
print("\nMissing values in crime_types_df:\n", crime_types_df.isnull().sum())
Missing values in crime_df: division_number 0 date_reported 0 date_occurred 0 area 0 area_name 0 reporting_district 0 part 0 crime_code 0 modus_operandi 28929 victim_age 4062 victim_gender 27729 victim_ethnicity 27731 premise_code 3 premise_description 92 weapon_code 131843 crime_code_1 2 crime_code_2 187239 crime_code_3 202560 crime_code_4 203071 incident_admincode 0 location 0 cross_street 167414 latitude 0 longitude 0 case_solved 0 dtype: int64 Missing values in weapon_types_df: weapon_code 0 weapon_description 0 dtype: int64 Missing values in crime_types_df: crime_code 0 crime_description 0 dtype: int64
Handling Missing Data¶
In this analysis, missing values are expected in several columns due to the nature of crime reporting. For columns where missingness indicates unreported or inapplicable information (e.g., weapon_code
, victim_gender
, modus_operandi
), we filled missing values with the label "Unknown"
. For victim_age
, we used the median age to impute missing values, ensuring statistical consistency. Columns with more than 95% missing values and no analytical value were dropped to streamline the dataset.
# Fill missing categorical data with a placeholder
for col in ['victim_gender', 'victim_ethnicity', 'modus_operandi',
'premise_description', 'cross_street']:
crime_df[col] = crime_df[col].fillna('Unknown')
# Fill missing weapon_code data with a 500 this is the value for be categorized based on the weapon types dataset
crime_df[col] = crime_df['weapon_code'].fillna(500)
# For numeric columns
crime_df = crime_df.dropna(subset=['victim_age', 'latitude', 'longitude'])
# Columns with excessive missing values
threshold = 0.90 # Drop columns with >90% missing
cols_to_drop = [col for col in crime_df.columns if crime_df[col].isna().mean() > threshold]
crime_df = crime_df.drop(columns=cols_to_drop)
print("Dropped columns due to excessive missing values:", cols_to_drop)
crime_df = crime_df.dropna()
Dropped columns due to excessive missing values: ['crime_code_2', 'crime_code_3', 'crime_code_4']
print("Missing values in crime_df:\n", crime_df.isnull().sum())
crime_df.shape
Missing values in crime_df: division_number 0 date_reported 0 date_occurred 0 area 0 area_name 0 reporting_district 0 part 0 crime_code 0 modus_operandi 0 victim_age 0 victim_gender 0 victim_ethnicity 0 premise_code 0 premise_description 0 weapon_code 0 crime_code_1 0 incident_admincode 0 location 0 cross_street 0 latitude 0 longitude 0 case_solved 0 dtype: int64
(69823, 22)
Remove or review duplicates and inconsistencies¶
# Check for duplicates
print("Duplicates in crime_df:", crime_df.duplicated().sum())
print("Duplicates in weapon_types_df:", weapon_types_df.duplicated().sum())
print("Duplicates in crime_types_df:", crime_types_df.duplicated().sum())
Duplicates in crime_df: 0 Duplicates in weapon_types_df: 0 Duplicates in crime_types_df: 0
# Remove duplicates
crime_df = crime_df.drop_duplicates()
weapon_types_df = weapon_types_df.drop_duplicates()
crime_types_df = crime_types_df.drop_duplicates()
Ensure correct data types¶
# Check data types
print("Crime Dataset:")
print(crime_df.dtypes)
print("\nWeapon Types:")
print(weapon_types_df.dtypes)
print("\nCrime Types:")
print(crime_types_df.dtypes)
Crime Dataset: division_number int64 date_reported object date_occurred object area int64 area_name object reporting_district int64 part int64 crime_code int64 modus_operandi object victim_age float64 victim_gender object victim_ethnicity object premise_code float64 premise_description object weapon_code float64 crime_code_1 float64 incident_admincode int64 location object cross_street float64 latitude float64 longitude float64 case_solved object dtype: object Weapon Types: weapon_code int64 weapon_description object dtype: object Crime Types: crime_code int64 crime_description object dtype: object
# Convert date columns
crime_df['date_reported'] = pd.to_datetime(crime_df['date_reported'], errors='coerce')
crime_df['date_occurred'] = pd.to_datetime(crime_df['date_occurred'], errors='coerce')
# Ensure categorical types
crime_df['division_number'] = crime_df['division_number'].astype('category')
crime_df['area'] = crime_df['area'].astype('category')
crime_df['reporting_district'] = crime_df['reporting_district'].astype('category')
crime_df['part'] = crime_df['part'].astype('category')
crime_df['crime_code'] = crime_df['crime_code'].astype('category')
crime_df['premise_code'] = crime_df['premise_code'].astype('category')
crime_df['crime_code_1'] = crime_df['crime_code_1'].astype('category')
crime_df['incident_admincode'] = crime_df['incident_admincode'].astype('category')
crime_df['weapon_code'] = crime_df['weapon_code'].astype('category')
crime_df['area_name'] = crime_df['area_name'].astype('category')
crime_df['case_solved'] = crime_df['case_solved'].astype('category')
crime_df['premise_description'] = crime_df['premise_description'].astype('category')
crime_df['victim_ethnicity'] = crime_df['victim_ethnicity'].astype('category')
crime_df['victim_gender'] = crime_df['victim_gender'].astype('category')
weapon_types_df['weapon_code'] = weapon_types_df['weapon_code'].astype('category')
weapon_types_df['weapon_description'] = weapon_types_df['weapon_description'].astype('category')
crime_types_df['crime_description'] = crime_types_df['crime_description'].astype('category')
crime_types_df['crime_code'] = crime_types_df['crime_code'].astype('category')
# Ensure number types
crime_df['victim_age'] = crime_df['victim_age'].astype(int)
crime_df['latitude'] = pd.to_numeric(crime_df['latitude'], errors='coerce')
crime_df['longitude'] = pd.to_numeric(crime_df['longitude'], errors='coerce')
weapon_types_df = weapon_types_df.set_index('weapon_code')
crime_types_df = crime_types_df.set_index('crime_code')
print("Crime Dataset:")
print(crime_df.dtypes)
print("\nWeapon Types:")
print(weapon_types_df.dtypes)
print("\nCrime Types:")
print(crime_types_df.dtypes)
Crime Dataset: division_number category date_reported datetime64[ns] date_occurred datetime64[ns] area category area_name category reporting_district category part category crime_code category modus_operandi object victim_age int32 victim_gender category victim_ethnicity category premise_code category premise_description category weapon_code category crime_code_1 category incident_admincode category location object cross_street float64 latitude float64 longitude float64 case_solved category dtype: object Weapon Types: weapon_description category dtype: object Crime Types: crime_description category dtype: object
Normalize inconsistent categorical values¶
# Loop through all categorical columns and show top 10 most frequent values
categorical_cols = crime_df.select_dtypes(include=['category', 'object']).columns
for col in categorical_cols:
print(f"\nColumn: {col}")
value_counts = crime_df[col].value_counts(dropna=False)
top_10 = value_counts.head(10)
print("Top 10 most frequent values:")
print(top_10)
print(f"Total unique values: {crime_df[col].nunique(dropna=True)}")
Column: division_number Top 10 most frequent values: division_number 201600896 1 211406652 1 211406712 1 211406711 1 211406710 1 211406707 1 211406704 1 211406673 1 211406645 1 211407280 1 Name: count, dtype: int64 Total unique values: 69823 Column: area Top 10 most frequent values: area 12 6259 1 5224 18 4757 6 4674 3 4473 13 4033 2 3904 20 3764 14 3325 5 2971 Name: count, dtype: int64 Total unique values: 21 Column: area_name Top 10 most frequent values: area_name 77th Street 6259 Central 5224 Southeast 4757 Hollywood 4674 Southwest 4473 Newton 4033 Rampart 3904 Olympic 3764 Pacific 3325 Harbor 2971 Name: count, dtype: int64 Total unique values: 21 Column: reporting_district Top 10 most frequent values: reporting_district 645 520 636 462 646 403 1822 345 162 333 119 317 245 305 1241 290 1268 287 666 286 Name: count, dtype: int64 Total unique values: 1138 Column: part Top 10 most frequent values: part 2 37675 1 32148 Name: count, dtype: int64 Total unique values: 2 Column: crime_code Top 10 most frequent values: crime_code 624 15410 230 12168 626 10216 210 7064 930 4042 761 3218 236 2912 740 1386 310 1200 220 1093 Name: count, dtype: int64 Total unique values: 102 Column: modus_operandi Top 10 most frequent values: modus_operandi 0416 687 0416 1822 300 0329 243 0421 236 0400 235 1822 0416 225 1501 190 0416 0913 186 0400 0416 179 0444 177 Name: count, dtype: int64 Total unique values: 51648 Column: victim_gender Top 10 most frequent values: victim_gender M 31070 F 28735 X 4052 Male 3116 Female 2831 Unknown 11 H 8 Name: count, dtype: int64 Total unique values: 7 Column: victim_ethnicity Top 10 most frequent values: victim_ethnicity H 31804 B 15659 W 11550 O 4658 X 4587 A 1348 K 109 F 41 Rare 18 I 15 Name: count, dtype: int64 Total unique values: 14 Column: premise_code Top 10 most frequent values: premise_code 101.0 15431 501.0 12222 502.0 11612 102.0 6924 108.0 4123 203.0 2732 210.0 971 122.0 965 103.0 818 109.0 747 Name: count, dtype: int64 Total unique values: 264 Column: premise_description Top 10 most frequent values: premise_description STREET 15431 SINGLE FAMILY DWELLING 12222 MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC) 11612 SIDEWALK 6924 PARKING LOT 4123 OTHER BUSINESS 2732 RESTAURANT/FAST FOOD 971 VEHICLE, PASSENGER/TRUCK 965 ALLEY 818 PARK/PLAYGROUND 747 Name: count, dtype: int64 Total unique values: 263 Column: weapon_code Top 10 most frequent values: weapon_code 400.0 37044 500.0 6943 511.0 4963 102.0 4702 109.0 1639 106.0 1581 200.0 1540 207.0 1229 512.0 815 307.0 756 Name: count, dtype: int64 Total unique values: 73 Column: crime_code_1 Top 10 most frequent values: crime_code_1 624.0 15479 230.0 12170 626.0 10277 210.0 7066 930.0 4023 761.0 3095 236.0 2913 740.0 1418 310.0 1203 220.0 1093 Name: count, dtype: int64 Total unique values: 102 Column: incident_admincode Top 10 most frequent values: incident_admincode 0 41187 1 28636 Name: count, dtype: int64 Total unique values: 2 Column: location Top 10 most frequent values: location 6TH 242 7TH 237 800 N ALAMEDA ST 221 7TH ST 219 6TH ST 209 WESTERN AV 190 5TH 179 HOLLYWOOD 171 BROADWAY 167 FIGUEROA ST 166 Name: count, dtype: int64 Total unique values: 20541 Column: case_solved Top 10 most frequent values: case_solved Not solved 41275 Solved 28548 Name: count, dtype: int64 Total unique values: 2
Removing columns without documentation or information to be handled¶
modus_operandi, division_number, reporting_district
crime_df = crime_df.drop(columns=['modus_operandi','division_number','reporting_district', 'cross_street', 'incident_admincode','location'])
crime_df = crime_df.drop_duplicates()
Victim Gender Value Standardization¶
The victim_gender
column initially contained abbreviations, full words, and ambiguous codes ('M'
, 'Male'
, 'F'
, 'Female'
, 'X'
, 'H'
, 'Unknown'
). To ensure consistency, all values were mapped to a standard set: 'Male'
, 'Female'
, 'Non-binary'
, and 'Homosexual'
.
This avoids double-counting and clarifies how non-binary or unspecified genders are represented.
# Normalizing values for 'victim_gender'
gender_map = {
'M': 'Male',
'F': 'Female',
'Male': 'Male',
'Female': 'Female',
'X': 'Non-binary',
'H': 'Homosexual',
'Unknown': 'Unknown'
}
crime_df['victim_gender'] = crime_df['victim_gender'].map(gender_map).fillna('Unknown')
print(crime_df['victim_gender'].unique())
['Female' 'Non-binary' 'Male' 'Homosexual' 'Unknown']
Victim Ethnicity Value Standardization¶
The victim_ethnicity
column contained a variety of single-letter codes and ambiguous values, such as 'W'
, 'H'
, 'O'
, 'X'
, and 'Rare'
. To improve clarity, these values were mapped to broader, standardized categories (e.g., 'White'
, 'Black'
, 'Hispanic'
, 'Asian'
, 'Other'
, 'Unknown'
). This ensures more consistent and interpretable results in the analysis.
# Ethnicity mapping (adjust as needed for your assignment or codebook)
ethnicity_map = {
'W': 'White',
'B': 'Black',
'H': 'Hispanic',
'A': 'Asian',
'C': 'Chinese',
'I': 'Indigenous',
'K': 'Korean', # Without a codebook, group as 'Other'
'F': 'Filipino',
'V': 'Vietnamese',
'J': 'Japanese',
'Z': 'Other',
'O': 'Other',
'X': 'Unknown',
'Unknown': 'Unknown',
'Rare': 'Other'
}
crime_df['victim_ethnicity'] = crime_df['victim_ethnicity'].map(ethnicity_map).fillna('Unknown')
# Check new unique values
print(crime_df['victim_ethnicity'].unique())
['Black' 'Unknown' 'White' 'Hispanic' 'Other' 'Asian' 'Indigenous' 'Filipino' 'Korean' 'Chinese' 'Vietnamese' 'Japanese']
Exploratory Data Analysis (EDA)¶
Explore and understand the structure, trends, and relationships in the data.
- Perform required operations to bring in extra info from the provided crime and weapon lookup files.
- Generate summary statistics
- Visualize distributions and correlations
- Analyze time- and location-based patterns
- Explore solve rate by various features
Merging the datasets for analysis and reporting¶
crime_df.head(5)
date_reported | date_occurred | area | area_name | part | crime_code | victim_age | victim_gender | victim_ethnicity | premise_code | premise_description | weapon_code | crime_code_1 | latitude | longitude | case_solved | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 | 2021-08-21 | 2021-08-21 23:51:00 | 12 | 77th Street | 1 | 210 | 51 | Female | Black | 101.0 | STREET | 400.0 | 210.0 | 33.9897 | -118.2915 | Solved |
28 | 2021-02-20 | 2021-02-20 22:50:00 | 2 | Rampart | 2 | 745 | 0 | Female | Unknown | 122.0 | VEHICLE, PASSENGER/TRUCK | 500.0 | 745.0 | 34.0891 | -118.2992 | Not solved |
29 | 2021-07-29 | 2021-07-28 21:00:00 | 14 | Pacific | 1 | 310 | 0 | Non-binary | Unknown | 120.0 | STORAGE SHED | 500.0 | 310.0 | 33.9586 | -118.4485 | Not solved |
36 | 2021-12-25 | 2021-12-25 19:00:00 | 1 | Central | 1 | 210 | 53 | Male | White | 101.0 | STREET | 500.0 | 210.0 | 34.0430 | -118.2420 | Not solved |
37 | 2021-06-24 | 2021-06-24 02:35:00 | 6 | Hollywood | 1 | 210 | 40 | Male | Black | 101.0 | STREET | 102.0 | 210.0 | 34.0998 | -118.3288 | Not solved |
weapon_types_df.head(5)
weapon_description | |
---|---|
weapon_code | |
400 | STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) |
500 | UNKNOWN WEAPON/OTHER WEAPON |
102 | HAND GUN |
201 | KNIFE WITH BLADE OVER 6 INCHES IN LENGTH |
107 | OTHER FIREARM |
crime_types_df.head(5)
crime_description | |
---|---|
crime_code | |
480 | BIKE - STOLEN |
510 | VEHICLE - STOLEN |
350 | THEFT, PERSON |
440 | THEFT PLAIN - PETTY ($950 & UNDER) |
420 | THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) |
print(weapon_types_df.columns)
print(crime_types_df.columns)
Index(['weapon_description'], dtype='object') Index(['crime_description'], dtype='object')
# Merge with weapon types
crime_df = crime_df.merge(weapon_types_df, on='weapon_code', how='left')
# Merge with crime types
crime_df = crime_df.merge(crime_types_df, on='crime_code', how='left')
crime_df['weapon_code'] = crime_df['weapon_code'].astype('category')
crime_df['crime_code'] = crime_df['crime_code'].astype('category')
# Preview merged columns
print(crime_df[['weapon_code', 'weapon_description', 'crime_code', 'crime_description']].head())
weapon_code weapon_description crime_code \ 0 400.0 STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) 210 1 500.0 UNKNOWN WEAPON/OTHER WEAPON 745 2 500.0 UNKNOWN WEAPON/OTHER WEAPON 310 3 500.0 UNKNOWN WEAPON/OTHER WEAPON 210 4 102.0 HAND GUN 210 crime_description 0 ROBBERY 1 VANDALISM - MISDEAMEANOR ($399 OR UNDER) 2 BURGLARY 3 ROBBERY 4 ROBBERY
Generate summary statistics¶
The summary statistics below provide an overview of the numerical and categorical features in the dataset, including victim demographics, crime counts by type and area, and the proportion of solved cases.
# Summary statistics for numeric columns
print("Numeric Summary Statistics:")
display(crime_df.describe())
Numeric Summary Statistics:
date_reported | date_occurred | victim_age | latitude | longitude | |
---|---|---|---|---|---|
count | 69479 | 69479 | 69479.000000 | 69479.000000 | 69479.000000 |
mean | 2021-07-07 08:10:23.428661760 | 2021-07-05 20:22:01.095870720 | 34.621022 | 33.807383 | -117.457399 |
min | 2021-01-01 00:00:00 | 2021-01-01 00:01:00 | 0.000000 | 0.000000 | -118.667200 |
25% | 2021-04-12 00:00:00 | 2021-04-10 02:55:00 | 24.000000 | 33.996900 | -118.399800 |
50% | 2021-07-11 00:00:00 | 2021-07-09 12:45:00 | 33.000000 | 34.049900 | -118.304900 |
75% | 2021-10-02 00:00:00 | 2021-09-30 19:05:00 | 46.000000 | 34.110400 | -118.269600 |
max | 2021-12-31 00:00:00 | 2021-12-31 23:30:00 | 99.000000 | 34.334300 | 0.000000 |
std | NaN | NaN | 17.686676 | 2.932096 | 10.180443 |
def get_iqr_bounds(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return lower, upper
lat_lower, lat_upper = get_iqr_bounds(crime_df['latitude'])
lon_lower, lon_upper = get_iqr_bounds(crime_df['longitude'])
print(f"Latitude bounds: {lat_lower:.5f} to {lat_upper:.5f}")
print(f"Longitude bounds: {lon_lower:.5f} to {lon_upper:.5f}")
Latitude bounds: 33.82665 to 34.28065 Longitude bounds: -118.59510 to -118.07430
outliers = crime_df[
(crime_df['latitude'] < lat_lower) | (crime_df['latitude'] > lat_upper) |
(crime_df['longitude'] < lon_lower) | (crime_df['longitude'] > lon_upper)
]
print(f"Total outliers found: {outliers.shape[0]}")
Total outliers found: 5615
crime_df = crime_df[
(crime_df['latitude'] >= lat_lower) & (crime_df['latitude'] <= lat_upper) &
(crime_df['longitude'] >= lon_lower) & (crime_df['longitude'] <= lon_upper)
].copy()
print(f"Rows after removing outliers: {crime_df.shape[0]}")
Rows after removing outliers: 63864
# Summary statistics for categorical/object columns
print("Categorical Summary Statistics:")
display(crime_df.describe(include=['category', 'object']))
Categorical Summary Statistics:
area | area_name | part | crime_code | victim_gender | victim_ethnicity | premise_code | premise_description | weapon_code | crime_code_1 | case_solved | weapon_description | crime_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 63864 | 63864 | 63864 | 63864 | 63864 | 63864 | 63864.0 | 63864 | 63864.0 | 63864.0 | 63864 | 63864 | 63864 |
unique | 21 | 21 | 2 | 100 | 5 | 12 | 256.0 | 255 | 73.0 | 100.0 | 2 | 73 | 100 |
top | 12 | 77th Street | 2 | 624 | Male | Hispanic | 101.0 | STREET | 400.0 | 624.0 | Not solved | STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) | BATTERY - SIMPLE ASSAULT |
freq | 6189 | 6189 | 34108 | 13980 | 31379 | 28763 | 14274.0 | 14274 | 33607.0 | 14039.0 | 38151 | 33607 | 13980 |
# Solve rate
print("Case Solved Rate:")
display(crime_df['case_solved'].value_counts(normalize=True))
# Top 5 most common crime types
print("Top 5 Crime Types:")
display(crime_df['crime_description'].value_counts().head())
# Top 5 areas by crime count
print("Top 5 Areas:")
display(crime_df['area_name'].value_counts().head())
Case Solved Rate:
case_solved Not solved 0.597379 Solved 0.402621 Name: proportion, dtype: float64
Top 5 Crime Types:
crime_description BATTERY - SIMPLE ASSAULT 13980 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 11190 INTIMATE PARTNER - SIMPLE ASSAULT 9122 ROBBERY 6661 CRIMINAL THREATS - NO WEAPON DISPLAYED 3655 Name: count, dtype: int64
Top 5 Areas:
area_name 77th Street 6189 Central 5138 Southeast 4691 Hollywood 4582 Southwest 4439 Name: count, dtype: int64
Visualize distributions and correlations¶
# Victim Age Distribution
plt.figure(figsize=(8, 4))
sns.histplot(crime_df['victim_age'].dropna(), bins=50, kde=True)
plt.title('Distribution of Victim Age')
plt.xlabel('Victim Age')
plt.ylabel('Count')
plt.xlim(0, 100)
plt.xticks(range(0, 101, 5), rotation=45)
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.show()
plt.figure(figsize=(10, 5))
order = crime_df['victim_ethnicity'].value_counts().index
# Get counts and percentages
counts = crime_df['victim_ethnicity'].value_counts()
percentages = counts / counts.sum() * 100
ax = sns.countplot(
y='victim_ethnicity',
data=crime_df,
order=order,
hue='victim_ethnicity',
palette='Blues_r',
legend=False
)
plt.title('Distribution of Victim Ethnicity')
plt.xlabel('Count')
plt.ylabel('Victim Ethnicity')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
plt.figure(figsize=(7, 4))
order = crime_df['victim_gender'].value_counts().index
# Get counts and percentages
counts = crime_df['victim_gender'].value_counts()
percentages = counts / counts.sum() * 100
ax = sns.countplot(
y='victim_gender',
data=crime_df,
order=order,
hue='victim_gender',
palette='Blues_r',
legend=False
)
plt.title('Distribution of Victim Gender')
plt.xlabel('Count')
plt.ylabel('Victim Gender')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.tight_layout()
plt.show()
print(crime_df['crime_description'].isnull().sum())
print(crime_df['crime_description'].value_counts().head(10))
0 crime_description BATTERY - SIMPLE ASSAULT 13980 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 11190 INTIMATE PARTNER - SIMPLE ASSAULT 9122 ROBBERY 6661 CRIMINAL THREATS - NO WEAPON DISPLAYED 3655 BRANDISH WEAPON 2925 INTIMATE PARTNER - AGGRAVATED ASSAULT 2614 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 1363 BURGLARY 1149 ATTEMPTED ROBBERY 1029 Name: count, dtype: int64
# Calculate and print the top 10 for verification
top_crimes = crime_df['crime_description'].str.strip().value_counts().head(10)
top_crimes_df = top_crimes.reset_index()
top_crimes_df.columns = ['crime_description', 'count']
plt.figure(figsize=(10, 5))
sns.barplot(data=top_crimes_df, y='crime_description', x='count')
plt.title('Top 10 Crime Types')
plt.xlabel('Number of Crimes')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.ylabel('Crime Type')
plt.tight_layout()
plt.show()
# Crimes by Area (Top 10)
plt.figure(figsize=(10, 5))
top_areas = crime_df['area_name'].str.strip().value_counts().head(10)
sns.barplot(y=top_areas.index, x=top_areas.values)
plt.title('Top 10 Areas by Crime Count')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.xlabel('Number of Crimes')
plt.ylabel('Area')
plt.tight_layout()
plt.show()
# Select only numeric columns (drop non-numeric ones)
numeric_cols = crime_df.select_dtypes(include='number')
plt.figure(figsize=(8, 6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Heatmap for Numeric Features')
plt.show()
Analyze time- and location-based patterns¶
To identify trends, the dataset was explored across multiple time features (year, month, day of week). Results showed temporal cycles in crime volume, with certain days and months having more incidents. Geographic analysis highlighted concentrations in specific city areas, and mapping latitude/longitude revealed clusters and hotspots of criminal activity.
# Extract date time values from the date occurred
crime_df['year'] = crime_df['date_occurred'].dt.year
crime_df['month'] = crime_df['date_occurred'].dt.month
crime_df['day_of_week'] = crime_df['date_occurred'].dt.day_name()
crime_df['month_name'] = crime_df['date_occurred'].dt.month_name()
crime_df['hour'] = crime_df['date_occurred'].dt.hour
# Crimes per month (over all years)
monthly_counts = crime_df.groupby(['year', 'month']).size().reset_index(name='crime_count')
monthly_counts['year_month'] = monthly_counts['year'].astype(str) + '-' + monthly_counts['month'].astype(str).str.zfill(2)
plt.figure(figsize=(14, 5))
plt.plot(monthly_counts['year_month'], monthly_counts['crime_count'], marker='o')
plt.title('Crimes per Month')
plt.xlabel('Year-Month')
plt.ylabel('Number of Crimes')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
# Count crimes per month (all years)
month_counts = crime_df['month_name'].value_counts().reindex([
'January','February','March','April','May','June','July','August','September','October','November','December'
])
plt.figure(figsize=(10,5))
sns.barplot(x=month_counts.index, y=month_counts.values)
plt.title('Total Crimes by Month (All Years Combined)')
plt.ylabel('Number of Crimes')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.show()
plt.figure(figsize=(8, 4))
sns.countplot(data=crime_df, x='day_of_week', order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.title('Crime Count by Day of the Week')
plt.ylabel('Number of Crimes')
plt.xlabel('Day of the Week')
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.show()
# Count crimes per hour
hour_counts = crime_df['hour'].value_counts().sort_index()
plt.figure(figsize=(10,5))
sns.barplot(x=hour_counts.index, y=hour_counts.values, hue=hour_counts.index, legend=False, palette='viridis')
plt.title('Crime Count by Hour of the Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Crimes')
plt.xticks(range(0,24))
plt.grid(True, which='both', axis='both', linestyle='--', alpha=0.7) # Adds major/minor grids
plt.tight_layout()
plt.show()
# Day of week
weekday_counts = crime_df['day_of_week'].value_counts()
weekday_max = weekday_counts.idxmax()
weekday_min = weekday_counts.idxmin()
# Hour of day
hour_counts = crime_df['hour'].value_counts()
hour_max = hour_counts.idxmax()
hour_min = hour_counts.idxmin()
# Month
month_counts = crime_df['month_name'].value_counts()
month_max = month_counts.idxmax()
month_min = month_counts.idxmin()
summary_table = pd.DataFrame({
'Time Feature': ['Weekday', 'Hour', 'Month'],
'Highest Crime': [f"{weekday_max} ({weekday_counts[weekday_max]})",
f"{hour_max} ({hour_counts[hour_max]})",
f"{month_max} ({month_counts[month_max]})"],
'Lowest Crime': [f"{weekday_min} ({weekday_counts[weekday_min]})",
f"{hour_min} ({hour_counts[hour_min]})",
f"{month_min} ({month_counts[month_min]})"]
})
print(summary_table)
Time Feature Highest Crime Lowest Crime 0 Weekday Sunday (10093) Tuesday (8304) 1 Hour 20 (3791) 5 (1082) 2 Month July (6100) February (4580)
plt.figure(figsize=(8,6))
sns.scatterplot(x='longitude', y='latitude', data=crime_df, alpha=0.1, s=5)
plt.title('Crime Locations (All Data)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()
The interactive map is centered at the median latitude and longitude of all incidents, corresponding to the area with the highest density of reported crimes.
# Calculate median latitude and longitude (good if your data has outliers)
center_lat = crime_df['latitude'].median()
center_lon = crime_df['longitude'].median()
crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=10)
# Plot a sample (for performance)
sample_df = crime_df[['latitude', 'longitude']].dropna().sample(500, random_state=1)
for idx, row in sample_df.iterrows():
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=3,
color='red',
fill=True,
fill_opacity=0.3
).add_to(crime_map)
folium.Marker(
location=[center_lat, center_lon],
popup='Hotspot Center',
icon=folium.Icon(color='blue', icon='info-sign')
).add_to(crime_map)
crime_map
The following heatmap visualizes the density of crime incidents across the city. Red and yellow areas indicate hotspots with the highest concentrations of reported crimes, centered at the area of greatest activity.
# Center map at the median
center_lat = crime_df['latitude'].median()
center_lon = crime_df['longitude'].median()
crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=10)
# Prepare the data (drop missing and sample for speed)
heat_data = crime_df[['latitude', 'longitude']].dropna().sample(2000, random_state=1).values.tolist()
# Add heatmap layer
HeatMap(heat_data, radius=15, blur=5, min_opacity=0.2).add_to(crime_map)
crime_map
Explore solve rate by various features¶
# Bin victim_age
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
age_labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90+']
crime_df['victim_age_group'] = pd.cut(crime_df['victim_age'], bins=age_bins, labels=age_labels, right=False)
def get_top_bottom_solve_rates(crime_df, feature, solved_label='Solved', n=3):
# Calculate solve rates
solve_by_feat = (
crime_df.groupby(feature, observed=True)['case_solved']
.value_counts(normalize=True)
.unstack()
.fillna(0)
)
# Sort by solve rate
sorted_solve = solve_by_feat.sort_values(by=solved_label, ascending=False)
# Get top n and bottom n (with solve rate > 0)
top_n = sorted_solve[sorted_solve[solved_label] > 0].head(n)
bottom_n = sorted_solve[sorted_solve[solved_label] > 0].tail(n)
return top_n[[solved_label]].reset_index(), bottom_n[[solved_label]].reset_index()
# Features to analyze
features = [
('crime_description', 'Crime Type'),
('weapon_description', 'Weapon'),
('victim_ethnicity', 'Victim Ethnicity'),
('victim_gender', 'Victim Gender'),
('area_name', 'Area'),
('victim_age_group', 'Victim Age Group')
]
# Build summary DataFrame
summary_rows = []
for feat_col, feat_label in features:
top, bottom = get_top_bottom_solve_rates(crime_df, feat_col)
# Add top
for i, row in top.iterrows():
summary_rows.append({
'Feature': feat_label,
'Category': row[feat_col],
'Solve Rate': round(row['Solved'], 3),
'Rank': 'Top'
})
# Add bottom
for i, row in bottom.iterrows():
summary_rows.append({
'Feature': feat_label,
'Category': row[feat_col],
'Solve Rate': round(row['Solved'], 3),
'Rank': 'Bottom'
})
summary_table = pd.DataFrame(summary_rows)
# Show top 3 and bottom 3 per feature in a pretty table
summary_table = summary_table.pivot_table(
index=['Feature', 'Rank'],
values=['Category', 'Solve Rate'],
aggfunc=lambda x: list(x)
).reset_index()
# Display the summary table
print(summary_table.to_string(index=False))
Feature Rank Category Solve Rate Area Bottom [Hollywood, Pacific, Southeast] [0.299, 0.294, 0.27] Area Top [Mission, West Valley, N Hollywood] [0.566, 0.564, 0.547] Crime Type Bottom [THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER), THEFT PLAIN - ATTEMPT, BURGLARY FROM VEHICLE] [0.106, 0.1, 0.058] Crime Type Top [LYNCHING, PROWLER, MANSLAUGHTER, NEGLIGENT] [1.0, 1.0, 1.0] Victim Age Group Bottom [10-19, 60-69, 0-9] [0.375, 0.372, 0.33] Victim Age Group Top [80-89, 20-29, 30-39] [0.454, 0.428, 0.42] Victim Ethnicity Bottom [Vietnamese, Unknown, Chinese] [0.333, 0.283, 0.167] Victim Ethnicity Top [Japanese, Filipino, Korean] [0.556, 0.545, 0.495] Victim Gender Bottom [Unknown, Male, Non-binary] [0.455, 0.362, 0.284] Victim Gender Top [Homosexual, Female, Unknown] [0.5, 0.462, 0.455] Weapon Bottom [UNKNOWN TYPE CUTTING INSTRUMENT, UNKNOWN FIREARM, SEMI-AUTOMATIC RIFLE] [0.174, 0.143, 0.143] Weapon Top [KITCHEN KNIFE, SWORD, GLASS] [0.668, 0.649, 0.637]
Visualization¶
Use effective visuals to support your analysis.
- Include charts such as bar plots, feature importance, and heatmaps
- Make all visualizations interpretable and relevant
- Label axes, legends, and titles clearly
def get_top10_solve_rate(crime_df, group_col, solved_label='Solved'):
solve_by_feature = (
crime_df.groupby(group_col, observed=True)['case_solved']
.value_counts(normalize=True)
.unstack()
.fillna(0)
.sort_values(by=solved_label, ascending=False)
)
top10 = solve_by_feature[solved_label].head(10).reset_index()
top10.columns = [group_col, 'solve_rate']
return top10
# Add to features
features = [
('crime_description', 'Solve Rate by Crime Type (Top 10)', 'Crime Type'),
('weapon_description', 'Solve Rate by Weapon (Top 10)', 'Weapon'),
('victim_ethnicity', 'Solve Rate by Victim Ethnicity (Top 10)','Victim Ethnicity'),
('victim_gender', 'Solve Rate by Victim Gender (Top 10)', 'Victim Gender'),
('area_name', 'Solve Rate by Area (Top 10)', 'Area'),
('victim_age_group', 'Solve Rate by Victim Age Group (Top 10)', 'Victim Age Group')
]
dfs = [get_top10_solve_rate(crime_df, f[0]) for f in features]
# Make subplots
fig = make_subplots(
rows=3, cols=2,
subplot_titles=[f[1] for f in features],
shared_xaxes=False, shared_yaxes=False,
horizontal_spacing=0.20
)
for idx, (df, (col, title, ylabel)) in enumerate(zip(dfs, features)):
row = idx // 2 + 1
col_num = idx % 2 + 1
fig.add_trace(
go.Bar(
x=df['solve_rate'],
y=df[col],
orientation='h',
marker_color='steelblue',
text=(df['solve_rate']*100).round(1).astype(str)+'%',
textposition='auto',
),
row=row, col=col_num
)
fig.update_yaxes(title_text=ylabel, row=row, col=col_num)
fig.update_xaxes(range=[0, 1], title_text='Solve Rate', row=row, col=col_num)
fig.update_layout(
height=1200, width=1000,
title_text="Solve Rate by Feature (Top 10)",
showlegend=False,
font=dict(size=8),
margin=dict(l=0, r=0, t=60, b=5),
)
for i in range(1, len(features) + 1):
fig['layout']['annotations'][i-1]['font'] = dict(size=12)
fig.show()
Feature Importance (using a simple model)
importances = None
# Prepare data
features_for_model = ['crime_description', 'weapon_description', 'victim_ethnicity',
'victim_gender', 'area_name', 'victim_age_group']
df_model = crime_df.dropna(subset=features_for_model + ['case_solved'])
# Encode categorical features numerically
df_model_encoded = df_model.copy()
label_encoders = {}
for col in features_for_model + ['case_solved']:
le = LabelEncoder()
df_model_encoded[col] = le.fit_transform(df_model_encoded[col].astype(str))
label_encoders[col] = le
X = df_model_encoded[features_for_model]
y = df_model_encoded['case_solved']
# Fit model
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)
model = RandomForestClassifier(n_estimators=100, random_state=1)
model.fit(X_train, y_train)
# Feature importance
importances = model.feature_importances_
imp_df = pd.DataFrame({'feature': features_for_model, 'importance': importances}).sort_values('importance', ascending=False)
fig = px.bar(imp_df, x='importance', y='feature', orientation='h',
title='Feature Importance for Predicting Case Solved',
text=imp_df['importance'].round(3))
fig.update_traces(textposition='outside')
fig.update_layout(yaxis_title='Feature', xaxis_title='Importance')
fig.show()
numeric_cols = crime_df.select_dtypes(include='number')
corr = numeric_cols.corr()
fig = ff.create_annotated_heatmap(
z=corr.values,
x=list(corr.columns),
y=list(corr.index),
annotation_text=corr.round(2).values,
colorscale='Viridis'
)
fig.update_layout(title_text='Correlation Heatmap of Numeric Features', width=800, height=800)
fig.show()
Feature Engineering¶
Create or refine features to enhance model quality.
- Transform or derive features
- Encode categorical variables
- Normalize or scale features as appropriate
crime_df.dtypes
date_reported datetime64[ns] date_occurred datetime64[ns] area category area_name category part category crime_code category victim_age int32 victim_gender object victim_ethnicity object premise_code category premise_description category weapon_code category crime_code_1 category latitude float64 longitude float64 case_solved category weapon_description category crime_description category year int32 month int32 day_of_week object month_name object hour int32 victim_age_group category dtype: object
Several features were engineered or transformed to enhance model quality and analysis. Categorical variables (such as gender and ethnicity) were cleaned for consistency. New time-based features—like year, month, hour, and day of week—were derived from date columns to support temporal analysis. Victim age was grouped into categorical bins to simplify trends by age segment. Code columns were joined with their descriptions for greater interpretability, and redundant columns were removed. All features were cast to their appropriate data types to ensure optimal performance for modeling and visualization.
Transform or Derive Features¶
Already derived:
- year
- month
- day_of_week
- hour
- victim_age_group
Additional:
- is_weekend (binary: Saturday/Sunday)
- day_of_year
- season
- Interaction Features Victim Gender/Ethnicity
- Interaction Features Area/Crime Type
# Weekend indicator
crime_df['is_weekend'] = crime_df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
# Season (basic example)
crime_df['season'] = crime_df['month'].map({12:'Winter', 1:'Winter', 2:'Winter',
3:'Spring', 4:'Spring', 5:'Spring',
6:'Summer', 7:'Summer', 8:'Summer',
9:'Fall', 10:'Fall', 11:'Fall'})
# Day of the year
crime_df['day_of_year'] = crime_df['date_occurred'].dt.dayofyear
# Interaction Features Victim Gender/Ethnicity
crime_df['victim_gender_ethnicity'] = (
crime_df['victim_gender'].astype(str) + '_' + crime_df['victim_ethnicity'].astype(str)
)
# Interaction Features Area/Crime Type
crime_df['area_crime_type'] = (
crime_df['area_name'].astype(str) + '_' + crime_df['crime_description'].astype(str)
)
Encode Categorical Variables¶
- For tree-based models (RandomForest, XGBoost): Label encoding is fine for most variables.
- For linear models or neural nets: use one-hot encoding or ordinal encoding.
# Create a copy for modeling/encoding
crime_model_df = crime_df.copy()
cat_cols = [
'area', 'area_name', 'part', 'crime_code', 'victim_gender', 'victim_ethnicity',
'premise_code', 'premise_description', 'weapon_code', 'crime_code_1','day_of_week', 'month_name',
'case_solved', 'weapon_description', 'crime_description', 'victim_age_group', 'season',
'victim_gender_ethnicity', 'area_crime_type'
]
label_encoders = {}
for col in cat_cols:
if col in crime_model_df.columns:
le = LabelEncoder()
crime_model_df[col] = le.fit_transform(crime_model_df[col].astype(str))
label_encoders[col] = le
crime_model_df.dtypes
date_reported datetime64[ns] date_occurred datetime64[ns] area int32 area_name int32 part int32 crime_code int32 victim_age int32 victim_gender int32 victim_ethnicity int32 premise_code int32 premise_description int32 weapon_code int32 crime_code_1 int32 latitude float64 longitude float64 case_solved int32 weapon_description int32 crime_description int32 year int32 month int32 day_of_week int32 month_name int32 hour int32 victim_age_group int32 is_weekend int32 season int32 day_of_year int32 victim_gender_ethnicity int32 area_crime_type int32 dtype: object
crime_model_df.head()
date_reported | date_occurred | area | area_name | part | crime_code | victim_age | victim_gender | victim_ethnicity | premise_code | ... | month | day_of_week | month_name | hour | victim_age_group | is_weekend | season | day_of_year | victim_gender_ethnicity | area_crime_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-08-21 | 2021-08-21 23:51:00 | 3 | 0 | 0 | 4 | 51 | 0 | 1 | 0 | ... | 8 | 2 | 1 | 23 | 5 | 1 | 2 | 233 | 1 | 51 |
1 | 2021-02-20 | 2021-02-20 22:50:00 | 11 | 13 | 1 | 56 | 0 | 0 | 9 | 20 | ... | 2 | 2 | 3 | 22 | 0 | 1 | 3 | 51 | 9 | 826 |
2 | 2021-07-29 | 2021-07-28 21:00:00 | 5 | 12 | 0 | 13 | 0 | 3 | 9 | 18 | ... | 7 | 6 | 5 | 21 | 0 | 0 | 2 | 209 | 31 | 719 |
3 | 2021-12-25 | 2021-12-25 19:00:00 | 0 | 1 | 0 | 4 | 53 | 2 | 11 | 0 | ... | 12 | 2 | 2 | 19 | 5 | 1 | 3 | 359 | 27 | 122 |
4 | 2021-06-24 | 2021-06-24 02:35:00 | 17 | 6 | 0 | 4 | 40 | 2 | 1 | 0 | ... | 6 | 4 | 6 | 2 | 4 | 0 | 2 | 175 | 17 | 388 |
5 rows × 29 columns
Normalize or Scale Features¶
- Scale numeric features (victim_age, latitude, longitude, hour, etc.) for models sensitive to feature magnitude (logistic regression, neural nets).
- Tree-based models do not require feature scaling, but it can help with KNN, SVM, etc.
Only scale the columns that are naturally continuous and exclude label-encoded categoricals.
numeric_cols = ['victim_age', 'latitude', 'longitude', 'year', 'month', 'hour', 'day_of_year']
scaler = StandardScaler()
crime_model_df[numeric_cols] = scaler.fit_transform(crime_model_df[numeric_cols])
Model Building¶
Build classification models to predict whether a case will be solved.
- Use at least two models (such as Logistic Regression, Random Forest)
- Perform a proper train-test split
- Apply appropriate preprocessing (e.g., scaling, encoding)
- Tune hyperparameters if needed
Prepare Features and Target¶
Choose your features (exclude non-numeric identifiers, text, or unused columns)
# Drop ID/time columns if not used in modeling
feature_cols = [
'victim_age', 'latitude', 'longitude', 'hour', 'year', 'month', 'day_of_year',
'area', 'area_name', 'part', 'crime_code', 'victim_gender', 'victim_ethnicity',
'premise_code', 'premise_description', 'weapon_code', 'crime_code_1',
'weapon_description', 'crime_description', 'victim_age_group',
'is_weekend', 'season', 'victim_gender_ethnicity', 'area_crime_type'
]
X = crime_model_df[feature_cols].copy()
y = crime_model_df['case_solved'].copy()
# First split: Train+Validation vs. Test
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
# Second split: Train vs. Validation
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, stratify=y_trainval, random_state=42)
# (Now: 60% train, 20% validation, 20% test)
print(f"Train X (Independent Features): {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")
print(f"Train y (Dependent Variable): {y_train.shape}, Val: {y_val.shape}, Test: {y_test.shape}")
Train X (Independent Features): (38318, 24), Val: (12773, 24), Test: (12773, 24) Train y (Dependent Variable): (38318,), Val: (12773,), Test: (12773,)
Model 1: Logistic Regression¶
Increased max_iter to 10,000 for Logistic Regression to ensure convergence with a high number of features and categorical variables.
# Main parameters shown: penalty, C, solver, max_iter, random_state
logreg = LogisticRegression(
penalty='l2', # Regularization type
C=1.0, # Inverse of regularization strength
solver='lbfgs', # Optimization algorithm
max_iter=10000, # Max number of iterations
random_state=42 # Seed
)
logreg.fit(X_train, y_train)
print("Logistic Regression fit parameters:")
print(logreg.get_params())
Logistic Regression fit parameters: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 10000, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model 2: Random Forest¶
# Main parameters shown: n_estimators, max_depth, min_samples_split, random_state
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=None, # Max tree depth
min_samples_split=2, # Min samples to split an internal node
random_state=42, # Seed
n_jobs=-1 # Use all CPU cores
)
rf.fit(X_train, y_train)
print("Random Forest fit parameters:")
print(rf.get_params())
Random Forest fit parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': -1, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}
Initial Validation & Test¶
# Validation
y_val_pred_logreg = logreg.predict(X_val)
y_val_pred_rf = rf.predict(X_val)
print("LogReg Validation Accuracy:", accuracy_score(y_val, y_val_pred_logreg))
print("RF Validation Accuracy:", accuracy_score(y_val, y_val_pred_rf))
# Final test can be performed after selecting the best model
LogReg Validation Accuracy: 0.6323494871995615 RF Validation Accuracy: 0.7042198387223049
Evaluate how well your models perform¶
- Use metrics such as accuracy, precision, recall, F1-score, AUC-ROC
- You can include confusion matrices and interpret results
- Discuss the effect of class imbalance and possible mitigation
- Compare model performance meaningfully
Generate All Key Metric¶
# Predict on validation and test sets (use your models: logreg, rf)
y_val_pred_logreg = logreg.predict(X_val)
y_val_pred_rf = rf.predict(X_val)
y_test_pred_logreg = logreg.predict(X_test)
y_test_pred_rf = rf.predict(X_test)
# For AUC, need predicted probabilities (for positive class)
y_val_proba_logreg = logreg.predict_proba(X_val)[:,1]
y_val_proba_rf = rf.predict_proba(X_val)[:,1]
y_test_proba_logreg = logreg.predict_proba(X_test)[:,1]
y_test_proba_rf = rf.predict_proba(X_test)[:,1]
Display Metrics¶
def print_metrics(y_true, y_pred, y_proba, model_name="Model"):
print(f"\n{model_name} Performance:")
print("Accuracy :", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall :", recall_score(y_true, y_pred))
print("F1-score :", f1_score(y_true, y_pred))
print("AUC-ROC :", roc_auc_score(y_true, y_proba))
print(classification_report(y_true, y_pred))
print_metrics(y_val, y_val_pred_logreg, y_val_proba_logreg, "Logistic Regression (Val)")
print_metrics(y_val, y_val_pred_rf, y_val_proba_rf, "Random Forest (Val)")
print_metrics(y_test, y_test_pred_logreg, y_test_proba_logreg, "Logistic Regression (Test)")
print_metrics(y_test, y_test_pred_rf, y_test_proba_rf, "Random Forest (Test)")
Logistic Regression (Val) Performance: Accuracy : 0.6323494871995615 Precision: 0.579849946409432 Recall : 0.3155745673731285 F1-score : 0.4087131704860237 AUC-ROC : 0.6499525115128045 precision recall f1-score support 0 0.65 0.85 0.73 7630 1 0.58 0.32 0.41 5143 accuracy 0.63 12773 macro avg 0.61 0.58 0.57 12773 weighted avg 0.62 0.63 0.60 12773 Random Forest (Val) Performance: Accuracy : 0.7042198387223049 Precision: 0.6690611840475601 Recall : 0.5251798561151079 F1-score : 0.5884531590413943 AUC-ROC : 0.7548723162379027 precision recall f1-score support 0 0.72 0.82 0.77 7630 1 0.67 0.53 0.59 5143 accuracy 0.70 12773 macro avg 0.69 0.68 0.68 12773 weighted avg 0.70 0.70 0.70 12773 Logistic Regression (Test) Performance: Accuracy : 0.6248336334455492 Precision: 0.5600821636425881 Recall : 0.3181022749368073 F1-score : 0.40575396825396826 AUC-ROC : 0.6369829558760982 precision recall f1-score support 0 0.64 0.83 0.73 7630 1 0.56 0.32 0.41 5143 accuracy 0.62 12773 macro avg 0.60 0.57 0.57 12773 weighted avg 0.61 0.62 0.60 12773 Random Forest (Test) Performance: Accuracy : 0.6923980270883896 Precision: 0.650844930417495 Recall : 0.5092358545595955 F1-score : 0.5713974037307734 AUC-ROC : 0.7365395813419047 precision recall f1-score support 0 0.71 0.82 0.76 7630 1 0.65 0.51 0.57 5143 accuracy 0.69 12773 macro avg 0.68 0.66 0.67 12773 weighted avg 0.69 0.69 0.68 12773
Confusion Matrix Plots¶
def plot_confusion(y_true, y_pred, title):
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(4,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title(title)
plt.show()
plot_confusion(y_val, y_val_pred_logreg, "LogReg Validation Confusion Matrix")
plot_confusion(y_val, y_val_pred_rf, "Random Forest Validation Confusion Matrix")
ROC Curves¶
RocCurveDisplay.from_estimator(logreg, X_val, y_val)
plt.title("Logistic Regression ROC Curve (Validation)")
plt.show()
RocCurveDisplay.from_estimator(rf, X_val, y_val)
plt.title("Random Forest ROC Curve (Validation)")
plt.show()
print(y.value_counts(normalize=True))
case_solved 0 0.597379 1 0.402621 Name: proportion, dtype: float64
Insights, Interpretation, and Reporting¶
Communicate your findings clearly, professionally, and meaningfully.
- Include an executive summary outlining key findings and recommendations
- Clearly explain what factors most influence case solvability
- Highlight any spatial, temporal, or demographic patterns
- Suggest improvements or interventions based on your results
- Reflect on limitations of your analysis and model
- Structure your report into logical, readable sections with explanations
Executive Summary¶
This project aimed to predict the solvability of crime cases using a comprehensive dataset of reported incidents, enriched by extensive feature engineering and machine learning analysis.
Key findings:
- Ethnic Distribution: The dataset reflects a predominance of Hispanic and Black victims, which may be linked to the underlying population distribution, reporting behaviour, or exposure to risk.
- Gender Distribution: Both male and female victims are prevalent, but males are somewhat more common in this dataset.
- Age Distribution: Victims are most commonly young adults, with a steep decline in frequency among older populations.
- Random Forest achieved a validation accuracy of 70.4% and an AUC-ROC of 0.75, outperforming Logistic Regression on all metrics.
- The most influential factors for case solvability included spatial (area), temporal (hour, day of week), demographic (victim age, gender, ethnicity), and crime-type features.
- Targeted interventions are recommended for districts and periods with lower solve rates, and for crime types and demographics with historically lower solvability.
- Limitations include missing values, possible unobserved variables, and moderate class imbalance.
Factors Influencing Case Solvability¶
Analysis of feature importance using Random Forest and SHAP summary plots reveals that:
- Area name and crime description are the top predictors of case solvability.
- Temporal factors such as hours and days of the week have a strong influence, with cases reported during the day and on weekdays being more likely to be solved.
- Victim demographics—especially the interaction of gender and ethnicity play a notable role, indicating the need for demographic-sensitive investigative strategies.
- Weapon and crime type further segment solvability, with certain violent crimes or weapon-involved incidents showing lower solve rates.
Spatial, Temporal, and Demographic Patterns¶
- Spatial Patterns:
Heatmap visualizations show clusters of both solved and unsolved cases. Some areas consistently outperform others in case resolution (see Table 2: Top 10 Areas by Solve Rate).
- Temporal Patterns:
Bar plots illustrate that crimes occurring during certain hours (especially early mornings and weekdays) have higher resolution rates. Seasonal analysis supports increased solvability in spring/summer.
- Demographic Patterns:
The “Solve Rate by Victim Age Group” and detailed bar plots for ethnicity and gender show age and combined demographics as significant, with middle-aged victims and certain demographic profiles more likely to have cases solved.
- Victim Demographics: Ethnicity, Gender, and Age Distribution
Ethnicity As shown in the Distribution of Victim Ethnicity bar plot, the dataset is predominantly composed of victims identified as Hispanic, Black, and White. Hispanic victims represent the largest group, with close to 29,000 cases, followed by Black (about 15,000) and White (approximately 10,000). Other ethnicities—including Asian, Korean, Filipino, Indigenous, Japanese, Chinese, and Vietnamese—are present in smaller numbers, each with fewer than 5,000 reported victims. The "Other" and "Unknown" categories also account for a notable share, highlighting the presence of missing or non-standardized data.
- Gender
The Distribution of Victim Gender plot indicates that the majority of crime victims in the dataset are Male and Female, with males slightly outnumbering females (over 31,000 vs. nearly 29,000 cases, respectively). Non-binary and unknown gender categories are present but represent a much smaller fraction of the records. There is also a minimal number of cases identified as "Homosexual" in the gender variable, which may reflect unique reporting conventions or data entry inconsistencies.
- Age
The Distribution of Victim Age histogram reveals a wide age range among victims, but with pronounced peaks in the 20–35 year age bracket. There is a sharp drop-off in victim counts for ages above 60, and a large spike at age 0, possibly reflecting the way infants or unknown ages are coded. The distribution skews toward younger adults, with the median age likely falling in the late twenties to early thirties.
Recommendations and Suggested Interventions¶
- Resource Allocation: Prioritize investigative resources in low-performing districts and peak hours identified.
- Training & Policy: Use insights from feature importance to train officers on cases more likely to remain unsolved, such as those involving certain crime types, weapons, or demographics.
- Data Collection: Improve data completeness, especially for weapon codes and victim demographic information.
- Analytical Expansion: Incorporate external factors (e.g., weather, special events) in future models to boost prediction accuracy.
- Class Imbalance: Continue exploring balanced class weighting and resampling.
Limitations¶
- Missing and Noisy Data: Several fields had missing or inconsistent entries, requiring removal or imputation.
- Feature Engineering: While new features (e.g., victim_age_group, season, area_crime_type) improved model performance, some potentially predictive attributes may be absent from the current dataset.
- Model Generalizability: The best-performing Random Forest model may overfit specific patterns in this dataset; performance on other cities or years may differ.
- Class Imbalance: Although moderate, it still affected recall for the “solved” class; further SMOTE or class weighting could help.
Conclusions¶
This analysis successfully demonstrated how data-driven approaches can be applied to predict the solvability of crime cases using real-world law enforcement data. By thoroughly cleaning and enriching the dataset, performing detailed exploratory analysis, and evaluating advanced classification models, we uncovered valuable patterns and actionable insights.
Demographic patterns provide important context for model interpretation and for developing prevention or intervention strategies. For example, targeted outreach, resource allocation, or investigative strategies may be more effective if informed by the prevailing victim profiles in the data. Additionally, understanding demographic skews can help identify possible data collection or reporting biases.
Predictive Modelling:
Random Forest models, after careful feature engineering and data preparation, provided the highest accuracy and recall for identifying solved cases, outperforming simpler linear approaches like Logistic Regression. This indicates the presence of complex, nonlinear relationships in the factors affecting case outcomes.
- Key Influences:
The analysis revealed that spatial (area/district), temporal (hour, day of week, season), demographic (victim age, gender, ethnicity), and crime-type features are among the most significant predictors of case solvability. Interaction features further enhanced model performance, confirming that multifaceted relationships exist within the data.
- Pattern Recognition:
Spatial heatmaps, time-based trend analyses, and demographic group bar plots helped reveal which neighbourhoods, periods, and victim profiles are associated with higher or lower case resolution rates.
- Practical Insights:
Results suggest that focusing investigative resources on low-performing districts, peak hours, and certain demographic or crime types may improve overall case solvability. Improved data collection, especially regarding victim and weapon characteristics, can further enhance prediction and operational strategies.
- Model Limitations:
The analysis is subject to limitations such as missing or incomplete data, moderate class imbalance, and the possible exclusion of relevant external factors (e.g., investigative effort, socioeconomic context). Model generalizability to other regions or future datasets may be limited without further validation.
- Future Directions:
Additional improvements could include integrating more diverse data sources, experimenting with more advanced ensemble or deep learning methods, and developing interpretable model explanations (such as SHAP plots) for operational use.
Overall, this notebook provides a strong foundation for using predictive analytics to support decision-making in law enforcement, offering both immediate practical recommendations and a roadmap for continued analytical development.