Modeling 2: Intro to classifiers with R

Intro and Objectives

In class we’ll spend some time learning about using logistic regression for binary classification problems - i.e. when our response variable has two possible outcomes (e.g. customer defaults on loan or does not default on loan). We’ll explore other simple classification approaches such as k-Nearest Neighbors and basic classification trees. Trees, forests, and their many variants have proved to be some of the most robust and effective techniques for classification problems.

This module will take us 1.5 weeks.

Readings

  • RforE - Sec 20.1 (logistic regression), Sec 23.4 (decision trees), Ch 26 (caret)

  • PDSwR - Ch 6 (kNN), 7.2 (logistic regression), 6.3 & 9.1 (trees and forests)

  • ISLR - Sec 3.5 (kNN), Sec 4.1-4.3 (Classification, logistic regression), Ch 8 (trees)

Downloads and other resources

Activities

We will work through a number of R Markdown and other files as we learn to build basic classifiers using R. Everything is available in the Downloads file above.

Intro to classification problems and the k-Nearest Neighbor technique

In this first part we’ll:

  • get a sense of what classification problems are all about,

  • get our first look at the very famous Iris dataset,

  • use a simple, model free technique, known as k-Nearest Neighbors, to try to classify Iris species using a few physical characteristics.

You’ll use knn/kNN_notes.Rmd and follow along with this screencast:

Logistic regression

Logistic regression is a variant of multiple linear regression in which the response variable is binary (two possible outcomes). It is a commonly used technique for binary classification problems. It’s definitely more “mathy” than kNN. I’ll try to help you develop some intuition and understanding of this technique without getting too deeply into the math/stat itself. See the Explore section at the bottom of this page for some good resources on the underlying math and stat of logistic regression.

You’ll use logistic_regression/IntroLogisticRegression_Loans_notes.Rmd and these screencasts:

We’ll start with a short introduction.

Now, we’ll review the statistical model and compare it to standard linear regression.

To do logistic regression in R, we use the glm(), or generalized linear model, command.

Do some model assessment and make predictions

More model and prediction assessment using confusionMatrix().

We’ll end with our final model comparisons and attempts on improvements.

Decision trees

Now on to learning about decision trees and variants such as random forests. You’ll use trees/classification_trees_notes.Rmd with these screencasts.

We’ll start with a short introduction.

So, how do decision trees decide how to create their branches? We’ll take a very brief look at this and point you to some resources to go deeper if you want.

We’ll end with our final model comparisons and attempts on improvements.

Putting it all together - the Kaggle Titanic challenge (OPTIONAL)

This is the famous Kaggle practice competition that so many people used as a first introduction to predictive modeling and to Kaggle. A number of very nice tutorials have been developed to help newcomers to Kaggle. So, take a look at the following R Markdown document. In addition to a little bit of EDA and some basic model building, you’ll find some interesting attempts at feature engineering as well as creating output files suitable for submitting to Kaggle to get scored. The Titanic Challenge is perpetually running, so feel free to try it out. You can’t pay much attention to the leader board as people have figured out ways to get 100% predictive accuracy.

  • titanic/Titanic_kaggle.Rmd

Explore (OPTIONAL)