Upcoming Assignments/Quizzes
Assignments | Open Time | Due Time |
---|---|---|
Module 11 Data Quiz | November 9th (1:00 am EST) | November 11th (11:55 pm EST) |
Module 11 Conceptual Quiz | November 9th (1:00 am EST) | November 11th (11:55 pm EST) |
Notes on Simple and Multiple Linear Regression
One of the more common requests/comments we get from students is a desire to see more and different examples. So for this week’s TA notes I’ll go through some of the concepts related to linear regression we’ve learned in the past two modules.
You’ll notice that I may use some different methods for subsetting and maniputlating data and plotting than we’ve covered in the course materials. There is always more than one way to do things in R, but these shouldn’t impact our final results and sometimes it’s helpful to see a different approach.
Load libraries and data
I’m going to use a couple of packages from the “tidyverse” suite of R packages. I like using these packages because I find the code easier to read and plots prettier than base R.
# Run this line if you have not installed these packages:
# install.packages(c("dplyr", "ggplot2"))
library(readr) # For reading in dataframe
library(dplyr) # For dataframe manipulation
library(ggplot2) # For nice plots
For this exercise, we’re going to use simple and multiple linear regression to model IMDB scores for movies from their top 5,00 movies (as of 2016). Here’s is the code to load this data into R and format the columns we need:
imdb <- read_csv("https://query.data.world/s/53o5lmqz56eb6pxsdc3qzh5glfavkv") %>%
select(title = movie_title, score = imdb_score, budget, gross, duration,
release_year = title_year, fb_likes = movie_facebook_likes,
rating = content_rating) %>%
na.omit() %>% # Removes NAs
filter_if(is.numeric, all_vars(. > 0)) # Removes rows with zeros
title | score | budget | gross | duration | release_year | fb_likes | rating |
---|---|---|---|---|---|---|---|
Avatar | 7.9 | 237000000 | 760505847 | 178 | 2009 | 33000 | PG-13 |
Spectre | 6.8 | 245000000 | 200074175 | 148 | 2015 | 85000 | PG-13 |
The Dark Knight Rises | 8.5 | 250000000 | 448130642 | 164 | 2012 | 164000 | PG-13 |
John Carter | 6.6 | 263700000 | 73058679 | 132 | 2012 | 24000 | PG-13 |
Tangled | 7.8 | 260000000 | 200807262 | 100 | 2010 | 29000 | PG |
Avengers: Age of Ultron | 7.5 | 250000000 | 458991599 | 141 | 2015 | 118000 | PG-13 |
Some of this code may look a little strange. The biggest difference is the %>%
, which is called a “pipe”. The pipe is a special operator that comes from the magrittr
package (dplyr
is calling it for us here), and what it does is take the output from whatever comes before the pipe and puts it into the the first argument of the next function.
This is useful when using the dpylr
package, which is set up to string together functions, kind of like verbs in a sentence. This does two things, it means we don’t need to make intermediate objects that won’t be used and it makes the code easier for humans to read. The code above can be “read” as:
First read a CSV from this location, then select these columns, then omit rows with NA, and finally filter out rows with zeros.
Looking at our data, our response variables is going to be imdb$score
, and we have a few different predictors: film budget, gross revenue made, duration, year of release, and even Facebook likes!
Checking correlations
First things first, let’s look at the correlations in our data. This will help us get an idea for which predictor variables are highly associated with our response variable, and tell which covariates are correlated (which may be important for our multiple regressions).
imdb %>%
select(- c(title, rating)) %>% # remove unneeded columns
cor() %>%
round(2)
score | budget | gross | duration | release_year | fb_likes | |
---|---|---|---|---|---|---|
score | 1.00 | 0.07 | 0.28 | 0.40 | -0.05 | 0.38 |
budget | 0.07 | 1.00 | 0.41 | 0.22 | 0.18 | 0.25 |
gross | 0.28 | 0.41 | 1.00 | 0.29 | 0.13 | 0.47 |
duration | 0.40 | 0.22 | 0.29 | 1.00 | -0.08 | 0.31 |
release_year | -0.05 | 0.18 | 0.13 | -0.08 | 1.00 | 0.36 |
fb_likes | 0.38 | 0.25 | 0.47 | 0.31 | 0.36 | 1.00 |
We can visualize these correlations using the ggcorrplot
package:
ggcorrplot::ggcorrplot(
imdb %>%
select(- c(title, rating)) %>%
cor(),
type = "lower",
method = "circle",
lab = T
)
Simple linear regression
One hypothesis could be that movies that make more money should have a hinger rating, so let’s use a simple linear regresion using gross revenue imdb$gross
to predict IMDB score imdb$score
. First, we’ll look at these varibles using a LOESS regression.
ggplot(imdb, aes(gross, score)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "loess", se = FALSE, size = 1.5) +
labs(x = "Gross revenue ($)", y = "IMDb rating",
title = "IMDb rating by gross revenue",
subtitle = "LOESS regression")
Hard to say if this will be a good predictors, there are a lot of low budget films! Let’s make the linear model and check some assumptions:
model <- lm(score ~ gross, data = imdb)
# Check assumptions
# Plot fitted vs. residuals
ggplot() +
geom_point(aes(x = model$fitted.values, y = model$residuals)) +
geom_hline(yintercept = 0, color = "black", linetype = 2) +
labs(title = "Fitted vs. Residuals",
subtitle = "IMDb Score ~ Gross Rev.",
x = "Fitted values",
y = "Residuals")
# QQ-plot
ggplot() +
stat_qq(aes(sample = rstandard(model))) +
stat_qq_line(aes(sample = rstandard(model))) +
labs(title = "Q-Q plot",
subtitle = "IMDb Score ~ Gross Rev.")
Our linear assumptions are not exactly what we would hope, likely due to the skewness of our predictor variable:
ggplot(imdb) +
geom_histogram(aes(gross)) +
labs(title = "Gross revenue is heavily right-skewed",
x = "Gross revenue ($)", y = "")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can try a log transformation to see if this better fits our assumptions?
# Try the log transform
ggplot(imdb, aes(gross, score)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "loess", se = FALSE, size = 1.5) +
scale_x_log10() +
labs(x = "Gross revenue ($)", y = "IMDb rating",
title = "IMDb rating by gross revenue",
subtitle = "LOESS regression")
ggplot(imdb) +
geom_histogram(aes(log(gross))) +
labs(title = "log of Gross revenue is less skewed",
x = "log(Gross revenue ($))", y = "")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Make new lm
log_model <- lm(score ~ log(gross), data = imdb)
# Check assumptions
# Plot fitted vs. residuals
ggplot() +
geom_point(aes(x = log_model$fitted.values, y = log_model$residuals)) +
geom_hline(yintercept = 0, color = "black", linetype = 2) +
labs(title = "Fitted vs. Residuals",
subtitle = "IMDb Score ~ log(Gross Rev.)",
x = "Fitted values",
y = "Residuals")
# QQ-plot
ggplot() +
stat_qq(aes(sample = rstandard(log_model))) +
stat_qq_line(aes(sample = rstandard(log_model))) +
labs(title = "Q-Q plot",
subtitle = "IMDb Score ~ log(Gross Rev.)")
This looks a little better, now let’s interpret some of these results.
anova(log_model)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## log(gross) 1 80.65 80.645 63.351 2.824e-15 ***
## Residuals 2070 2635.10 1.273
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(log_model)
##
## Call:
## lm(formula = score ~ log(gross), data = imdb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0053 -0.6923 0.1203 0.8018 2.7758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.05447 0.17880 28.269 < 2e-16 ***
## log(gross) 0.08565 0.01076 7.959 2.82e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.128 on 2070 degrees of freedom
## Multiple R-squared: 0.0297, Adjusted R-squared: 0.02923
## F-statistic: 63.35 on 1 and 2070 DF, p-value: 2.824e-15
These tables tell us all sorts of useful information (sum of squares, mean square error, etc.). Perhaps most importantly, we see that the slope estimate has an assocaited p-value < 0.05, and that it is positive. This suggests that movies that make more money do have higher IMDb ratings!
However, it is important to note that we’ve log-transformed our predictor variables, which makes it less clear to interpet. Now let’s plot these findings.
m <- summary(log_model)
ggplot(imdb) +
geom_point(aes(log(gross), score), alpha = 0.1) +
geom_abline(intercept = m$coefficients[1,1],
slope = m$coefficients[2,1], color = "#e74c3c") +
# Let's also add the 95% confidence interval
geom_abline(intercept = m$coefficients[1,1] - 1.96*m$coefficients[1,2],
slope = m$coefficients[2,1] - 1.96*m$coefficients[2,2],
color = "#e74c3c", linetype = 2) +
geom_abline(intercept = m$coefficients[1,1] + 1.96*m$coefficients[1,2],
slope = m$coefficients[2,1] + 1.96*m$coefficients[2,2],
color = "#e74c3c", linetype = 2) +
labs(x = "log Gross revenue ($)", y = "IMDb rating",
title = "IMDb rating by gross revenue",
subtitle = "IMDb Score ~ log(Gross Rev.)")