LA Crime Type Prediction

SVM with RBF kernel became the strongest deployable baseline with accuracy 0.321 and macro F1 0.303 across 138 modeled classes.

  • SVM
  • KNN
  • Naive Bayes
  • +4 more

Skim the case by context, role, result, and evidence.

Context

A multi-class classification case study predicting Los Angeles crime codes from time, location, premise, status, and victim attributes.

My role

Machine learning practitioner

Result

SVM with RBF kernel became the strongest deployable baseline with accuracy 0.321 and macro F1 0.303 across 138 modeled classes.

Evidence

976k raw crime records

Context

Crime data can reveal patterns across time, geography, premise, case status, and victim context. This academic project used cleaned Los Angeles crime records from 2020-2024 as a supervised multi-class classification problem.

Problem

The dataset was large, highly multi-class, and imbalanced. The challenge was not only to train a model, but to prepare categorical features, balance target classes, compare model behavior, and avoid overstating performance in a sensitive public-safety domain.

My Role

I worked on data preparation, model comparison, imbalance handling, and evaluation design.

Evidence

Los Angeles crime prediction dashboard showing raw records, modeled classes, balanced records, top crime types, day-night split, zones, and top areas
Dataset evidence: 976k raw records, 138 modeled classes, and a balanced 13,800-row modeling dataset.

Approach

  • Prepared features from time, area, reporting district, premise, case status, day of week, month, day/night category, geographic zone, and victim attributes.
  • Removed redundant or unsafe modeling columns, then encoded categorical features for classical machine learning models.
  • Removed single-instance target classes and balanced the dataset to 100 samples per modeled class.
  • Compared KNN, Naive Bayes, Logistic Regression, SVM, Decision Tree, and BPNN.
  • Evaluated results with Accuracy, macro Precision, macro Recall, macro F1, and macro AUC ROC.

Key Decisions

The project compared several model families instead of relying on a single high-level score. I treated SVM with RBF kernel as the strongest deployable baseline because it led accuracy and macro F1, while the BPNN result required audit despite a high AUC value.

Machine learning workflow diagram covering data preparation, exploratory analysis, model experimentation, evaluation, and model selection
Workflow evidence: data preparation, EDA, class balancing, model experimentation, evaluation, and selection.
Model comparison for LA Crime classification showing accuracy, F1, and AUC across tested models
Model evidence: SVM RBF led the usable baseline results with accuracy 0.321 and macro F1 0.303.

Result

SVM with RBF kernel and C = 10 became the strongest deployable baseline in the notebook, with accuracy 0.321 and macro F1 0.303 on the test set. The score is intentionally presented with context because the target space contains 138 modeled classes.

What I’d Improve

I would audit the BPNN label mapping, add top-k accuracy, include confusion-matrix analysis for frequent crime classes, test boosting models such as XGBoost or LightGBM, and add feature importance or SHAP-style interpretability.

Responsible Use Note

This project should be read as an academic modeling exercise, not as an automated decision-making system for policing or legal action. Crime datasets can reflect reporting bias, location bias, and complex social context that model scores alone cannot resolve.

Back to the full project index.

Return to all case studies and filter by modeling, automation, decision support, or analytics.

View all projects