Building a Predictive Model

This workshop is designed to give you the "gist" of how a predictive model can be built from historical data and why you might want to take this approach for a given problem.

This workshop will use typical advanced analytics tools. However, it is focused on illustrating concepts, not teaching specific tools, machine learning algorithms, or tactics to maximize performance. I will provide a list of recommended resources for those who wish to go deeper and am happy to help with any questions you have.

Key Concepts

  1. Business problems and predictive modeling solutions
  2. Preparing data for modeling
  3. Finding informative features*
  4. Building a baseline model and evaluating its performance
  5. Iteratively improving a model

*Also known as variables

1. Business Problems and Predictive Modeling Solutions

The first step of predictive modeling is mapping a business problem to a predictive modeling approach that can solve the problem. To be useful, the technology must solve the right business problem. Here is a fictional scenario that we will use for illustration:

The Seattle Police Department is undergoing a digital transformation initiative. The initiative aims to apply new technologies to improve services within the fixed resources allocated by the City Council. You are the first data scientist hired into a new predictive policing unit and are working through the long and tedious process of setting up your computer and obtaining access to the department's secure databases.

Suddenly, the police chief arrives unannounced in your cubicle. The chief just got off the phone with the Mayor. The people of Seattle are protesting against the current level of car thefts, and local business owners are threatening to close their brick-and-morter store fronts if shoplifting rates continue at current levels. The mayor cannot provide additional resources but has directed the chief to "find a way" to address the issues.

You quickly confirm that the existing patrol assignments cannot be changed or increased. However, you learn that officers can conduct their existing patrols in ways that are more aggresive towards mitigating either car thefts or shoplifting.

(Yes, this problem is contrived for the sake of giving a very clear modeling illustration. More interesting crime pairs can also work, but the problem gets more complicated).

Q: What is the business problem?

Q: What are potential predictive modeling solutions?

Q: What data do we have? What assumptions/limitations apply to historical, reported crime data?

With this insight, you propose using historical data to build a model that can predict whether a given patrol area at a given time is more likely to experience a car theft or shoplifting incident. You explain that the model can provide officers information for prioritization of car thefts vs. shoplifting in their patrols.

The chief likes your proposal, and asks you to come back with a prototype model and explaination for why the model should be trusted.

2. Preparing Data for Modeling

The second step is preparing data for modeling. Most of the time, predictive modeling requires a dataset formatted as follows:


  • Features are selected that might be relevant to the prediction being made. We generally start with everything that might be useful, and select or "engineer" the most informative features throughout the modeling process.
  • Labels are the values that we are trying to predict using the available features.
  • Samples are records that consist of features and a label.

We begin by loading some standard packages (software add-ons) to the Python environment. The details of these packages are out of scope for this workshop, but basic descriptions are provided in the comments below.

In [1]:
# Configuration for plotting in this notebook
%matplotlib inline

# General manipulation of tabular data and geographic data
import pandas as pd
import geopandas as gpd

#Numerical computing and linear algebra
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({'font.size': 15})
import seaborn as sns

# Algorithms and helper functions for training and evaluating machine learning models
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Standard package for programmatic access to your local filesystem (to be replaced with Snowflake) 
import os

# Constant variable for the filepath to your data
DATA_PATH = os.path.join('..', 'Data')

# This random seed ensures reproducibility of the modeling
In [2]:
# Sometimes have to re-run this in a second cell
matplotlib.rcParams.update({'font.size': 15})

Next, let's load in the modeling dataset which has been prepared from the data used in earlier workshops. Sometimes we will have to perform data preparation in Python using the pandas package. In this case, we have already prepared the data for you. We have included all reported car theft and shoplifting events starting in 2018.

2.1 Previewing the data

In [3]:
# Load data from a prepared .csv file
df = pd.read_csv('modeling_data.csv', parse_dates=['occurred_datetime', 'reported_datetime'])
df.head() # Look at the first five rows of data
report_number crime_subcategory primary_offense_description precinct sector beat neighborhood occurred_datetime reported_datetime population ... lat lon area_sq_meters avg_temp high_temp min_temp prcp fog heavy_fog smoke
0 20170000100027 theft-shoplift THEFT-SHOPLIFT EAST G G1 FIRST HILL 2017-03-21 20:31:00 2017-03-21 20:31:00 4893.979123 ... 47.604231 -122.319324 5.934704e+05 49 56 46 0.20 0 0 0
1 20170000100105 motor_vehicle_theft VEH-THEFT-RECREATION VEH SOUTHWEST F F3 SOUTH PARK 2017-03-20 15:00:00 2017-03-21 22:32:00 6077.596102 ... 47.526138 -122.336371 5.828516e+06 45 54 38 0.01 0 0 0
2 2017000010013 motor_vehicle_theft VEH-THEFT-AUTO EAST G G2 CENTRAL AREA/SQUIRE PARK 2017-01-08 21:00:00 2017-01-09 02:01:00 5005.383632 ... 47.604906 -122.298730 2.560248e+06 37 45 35 0.45 1 0 0
3 20170000100249 motor_vehicle_theft VEH-THEFT-AUTO NORTH L L2 NORTHGATE 2017-03-21 20:30:00 2017-03-21 22:47:00 5495.430608 ... 47.698570 -122.317705 5.082686e+06 49 56 46 0.20 0 0 0
4 20170000100475 motor_vehicle_theft VEH-THEFT-AUTO NORTH J J3 PHINNEY RIDGE 2017-03-21 20:00:00 2017-03-22 06:51:00 5629.742040 ... 47.681098 -122.338279 4.763904e+06 49 56 46 0.20 0 0 0

5 rows × 28 columns

Notice that the dataset is in a similar format to the used car example described above:

  • Each sample is a reported crime event.
  • "crime_subcategory" is the label, and can either be "theft-shoplift" or "motor_vehicle_theft"
  • All other columns are the raw features.

Our objective is to use the features to predict the crime-type label. A model of this format would allow an officer to input their next patrol time and location and receive a prediction of the more likely crime type.

2.2 Why do we hold out data?

After the dataset is in the correct format to begin modeling, we need to split the dataset and "hold out" a portion for evaluating our models. We use the majority of the historical data to train a model, and then we use the "held out" data to estimate how well we expect the model to work when predicting the labels for new, unseen records.

Q: What problems might we run into if we use the same data to build and then test a predictive model?


Overfitting - Why We Hold Data Out
Source: Andrew Ng's ML course -

Important - Overfitting
The avoidance of overfitting is a fundamental aspect of any predictive modeling problem. We seek to learn meaningful patterns that will generalize to new samples, as opposed to simply memorizing a historical dataset. end-key-point.PNG

2.3 What part of the data should we hold out?

Imagine you are predicting the price of PACCAR stock. Which hold out strategy should you use to evaluate your model?


Daily Adjusted Close of PCAR with Different Data Held Out

We will sort the records by date and use the first (oldest) 80% for training a model and last (most recent) 20% for evaluating the model. While this is not strictly a time series problem, the distribution of crime in a given week is likely related to the distribution in the prior and following week.

Q: Which records should we hold out for our crime problem?


An Illustration of Time-Based Splitting
In [4]:
# Sort the entire dataset by date and time
df.sort_values('occurred_datetime', inplace=True)

# Store the indices of the samples that correspond to the first 80% (train) and last 20% (test)
# We will use these later whenever we need to access only the train or test portion of the dataset
train_idx = np.arange(int(0.8*df.shape[0]))
test_idx = np.arange(int(0.8*df.shape[0]), df.shape[0])
print(f"First training data date: {df.iloc[train_idx]['occurred_datetime']}")
print(f"Last training data date: {df.iloc[train_idx]['occurred_datetime']}")
print(f"Last testing data date: {df.iloc[test_idx]['occurred_datetime']}")
First training data date: 2017-01-01
Last training data date: 2018-11-01
Last testing data date: 2019-05-06

3. Finding informative features

The next step of predictive modeling is similar to any type of prediction, understanding what information you have and whether it is informative of the event or value that you need to predict. In general, we have two primary types of features for each incident:

  1. When the event happened
  2. Where the event happened (this includes attributes of the location where the event happend)

Important - Using Available Data
When making predictions, we can only use information that will be available when the model is used. In this case, we know the time and location of a future patrol and can use these features. However, if the dataset included a feature for total number of crimes reported per beat on the day of the event, we would not be able to use it for this application (that statistic is not known until the end of the day). end-key-point.PNG

3.1 Exploring crime type vs. when event occurs

In [5]:
# Extract year, month, day of week, and hour of day from the date and time of occurence
# This creates new columns that only contain these specific variables
df['year'] = df['occurred_datetime'].dt.year
df['month'] = df['occurred_datetime'].dt.month
df['dayofweek'] = df['occurred_datetime'].dt.dayofweek
df['hour'] = df['occurred_datetime'].dt.hour

# Look at type of crime vs. year
grp_month_category = df.iloc[train_idx].groupby(['year', 'crime_subcategory'], as_index=False).agg({'report_number':len})
fig, ax = plt.subplots(figsize=(8,4))
sns.barplot(x='year', y='report_number', hue='crime_subcategory', data=grp_month_category, ax=ax)
ax.set_xticklabels([2017, 2018])
ax.legend(loc='lower right')

The proportionality of car theft and shoplifting seems fairly consistent over time. The comparison is also a little biased because we have placed the last two months of 2018 in the test set.

In [6]:
# Look at type of crime vs. month
grp_month_category = df.iloc[train_idx].groupby(['month', 'crime_subcategory'], as_index=False).agg({'report_number':len})
fig, ax = plt.subplots(figsize=(14,6))
sns.barplot(x='month', y='report_number', hue='crime_subcategory', data=grp_month_category, ax=ax)
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])