Predictive Modeling / 2025
LA Crime Type Prediction
SVM with RBF kernel became the strongest deployable baseline with accuracy 0.321 and macro F1 0.303 across 138 modeled classes.
-
Python
-
Scikit-learn
-
Pandas
- SVM
- KNN
- Naive Bayes
- +4 more
Quick read
Skim the case by context, role, result, and evidence.
A multi-class classification case study predicting Los Angeles crime codes from time, location, premise, status, and victim attributes.
Machine learning practitioner
SVM with RBF kernel became the strongest deployable baseline with accuracy 0.321 and macro F1 0.303 across 138 modeled classes.
976k raw crime records
Context
Crime data can reveal patterns across time, geography, premise, case status, and victim context. This academic project used cleaned Los Angeles crime records from 2020-2024 as a supervised multi-class classification problem.
Problem
The dataset was large, highly multi-class, and imbalanced. The challenge was not only to train a model, but to prepare categorical features, balance target classes, compare model behavior, and avoid overstating performance in a sensitive public-safety domain.
My Role
I worked on data preparation, model comparison, imbalance handling, and evaluation design.
Evidence
Approach
- Prepared features from time, area, reporting district, premise, case status, day of week, month, day/night category, geographic zone, and victim attributes.
- Removed redundant or unsafe modeling columns, then encoded categorical features for classical machine learning models.
- Removed single-instance target classes and balanced the dataset to 100 samples per modeled class.
- Compared KNN, Naive Bayes, Logistic Regression, SVM, Decision Tree, and BPNN.
- Evaluated results with Accuracy, macro Precision, macro Recall, macro F1, and macro AUC ROC.
Key Decisions
The project compared several model families instead of relying on a single high-level score. I treated SVM with RBF kernel as the strongest deployable baseline because it led accuracy and macro F1, while the BPNN result required audit despite a high AUC value.
Result
SVM with RBF kernel and C = 10 became the strongest deployable baseline in the notebook, with accuracy 0.321 and macro F1 0.303 on the test set. The score is intentionally presented with context because the target space contains 138 modeled classes.
What I’d Improve
I would audit the BPNN label mapping, add top-k accuracy, include confusion-matrix analysis for frequent crime classes, test boosting models such as XGBoost or LightGBM, and add feature importance or SHAP-style interpretability.
Responsible Use Note
This project should be read as an academic modeling exercise, not as an automated decision-making system for policing or legal action. Crime datasets can reflect reporting bias, location bias, and complex social context that model scores alone cannot resolve.