Chapter 3 Simple Linear Regression Models

3.1 Building an SLR Model

lm stands for Linear Model and is the function used for Linear Regression

model <- lm(Y ~ X, data)

Practical 1 Example

data(faithful)
model <- lm(waiting ~ eruptions, faithful)
summary(model)

## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Useful functions to extract data from model:

summ <- summary(model)

coef(model) gives coefficients
fitted(model) returns the vector of the fitted values, $\hat{y}_i = b_0 + b_1 x_i$
resid(model) (or summ$residuals) returns vector of residuals, $e_i = y_i - \hat{y}_i$
summ$coefficients gives more information on coefficient estimates (standard error, t-statistic, corresponding two-sided p-value)
summ$sigma extracts regression standard error
summ$r.squared returns value of $R^2$

3.2 Plotting an SLR Model

Using Base R

plot(faithful$waiting ~ faithful$eruptions, xlab="Eruption Time (m)",
     ylab="Waiting Time Between Eruptions (m)", pch=16, col="cornflowerblue")
abline(model, col="red")

Using `ggplot`

ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() +
  geom_smooth(method=lm, formula = y~x, se=FALSE)

geom_smooth Documentaion

3.3 Diagnostic Plots and Residual Analysis

Infant Mortality and GDP Example from MLLN Notes Section 3.3

model1 <- fit<- lm(infantMortality ~ ppgdp, data=newUN)
plot(newUN$infantMortality ~ newUN$ppgdp, xlab="GDP per Capita",
     ylab="Infant mortality (per 1000 births)", pch=16, col="cornflowerblue",
     main="Model 1")
abline(model1,col="red")

model2 <- lm(log(infantMortality) ~ log(ppgdp), data=newUN)
plot(log(newUN$infantMortality) ~ log(newUN$ppgdp), pch=16, col="cornflowerblue",
     main="Model 2")
abline(model2,col="red")

model1 clearly doesn’t fit SLR, we can confirm this be looking at diagnostic plots
model2 which is a transformation fits better and is an example of good diagnostic plots

Residual Plot

plot(model1, which=1, pch=16, col="cornflowerblue", main="Model 1 Residual Plot")

Comparison for Model 1 (poor fit) and Model 2 (good fit)

par(mfrow=c(1,2))
plot(model1, which=1, pch=16, col="cornflowerblue", main="Model 1 Residual Plot")
plot(model2,which=1,pch=16,col="cornflowerblue", main="Model 2 Residual Plot")

Example Using Simpler Methods (From Practical 1)

model is as from Practical 1

par(mfrow=c(1,2))
plot(y = resid(model), x=fitted(model)) #residuals against fitted values
plot(y = resid(model), x=d) #residuals against raw values

Residual Q-Q Plot and Histogram

Residual Q-Q Plot

plot(model1, which=2, pch=16, col="cornflowerblue", main="Model 1 Q-Q Plot")

Residual Histogram

hist(resid(model1), col="cornflowerblue", main="Model 1 Residual Histogram")

Comparison for Model 1 (poor fit) and Model 2 (good fit)

par(mfrow=c(2,2))
# Model 1 (before transformation)
plot(model1, which = 2,pch=16, col="cornflowerblue", main="Model 1 Q-Q")
hist(resid(model1),col="cornflowerblue", main="Model 1 Resid Hist")
# Model 2 (after transformation)
plot(model2, which = 2, pch=16, col="hotpink3", main="Model 2 Q-Q")  
hist(resid(model2),col="hotpink3", main="Model 2 Resid Hist")

Example Using Simpler Methods (From Practical 1)

model is as from Practical 1

par(mfrow=c(1,2))
hist(resid(model))
qqnorm(resid(model))

3.4 Transforming Regression Variables

When a linear regression model doesn’t look like a good fit, it may be appropriate to transform one or both of the variables.

Example Making Log transformation (for Y and X)

transformed_model <- lm(log(Y) ~ log(X), data)

Example Making Polynomial Transformation

transformed_model <- lm(Y ~ I(X^2), data) # I() is a general wrapper

Using ggplot

Change the formula parameter in geom_smooth

3.5 Confidence and Prediction Intervals

3.5.1 Confidence Intervals

Easiest to use confint(). confint(model, level=...) gives a confidence interval for each coefficient.

Example from Notes

carSales<-data.frame(Price=c(85,103,70,82,89,98,66,95,169,70,48),
                     Age=c(5,4,6,5,5,5,6,6,2,7,7))
reg <- lm(Price ~ Age, carSales)
confint(reg, level=0.95) #CI for parameters

##                 2.5 %    97.5 %
## (Intercept) 160.99243 229.94451
## Age         -26.59419 -13.92833

Practical 1 Example

Again, model is as from Practical 1

beta1hat <-  coef(model)[2]
se.beta1 <- summary(model)$coefficients[2,2]
n <- length(w)
  
#The Confidence Interval is
beta1hat + c(-1,1) * qt(0.975, n-2) * se.beta1

## [1] 10.10996 11.34932

# or
confint(model, level=0.95)[2,]

##    2.5 %   97.5 % 
## 10.10996 11.34932

CI at a new point When newpoints is a data frame with same x column name and a new point you can calculate the model CI around that point using predict() with interval = "confidence".

predict(model, newdata = newpoints, interval = "confidence", level = 0.95)

3.5.2 Prediction Intervals

Easiest to use predict(). To get a prediciton interval around a new point, use predict() with interval = "prediction".

predict(model, newdata = newpoints, interval = "prediction", level = 0.95)

3.5.3 Plotting Confidence and Prediction Intervals

Base R

To plot a CI or PI line, use seq(a, b, by = ...) and predict() with newdata = data.frame(X = seq(a, b, by = ...)) and then plot the output using abline(). Using interval = "confidence" or interval = "prediction" depending on if you want a CI or a PI.

ggplot Confidence Interval

Let se = TRUE in geom_smooth.

ggplot Prediction Interval:

temp_var <- predict(model, interval="prediction")
new_df <- cbind(data1, temp_var)

ggplot(new_df, aes(x=X, y=Y))+
    geom_point() +
    geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
    geom_line(aes(y=upr), color = "red", linetype = "dashed")+
    geom_smooth(method=lm, se=TRUE)

3.6 Misc

Pearson Correlation Coefficient ($r$) cor(A, B) measures linear relationship between A and B and $r = \sqrt{R^2}$