Chapter 3 Simple Linear Regression Models
3.1 Building an SLR Model
lm
stands for Linear Model and is the function used for Linear Regression
<- lm(Y ~ X, data) model
Practical 1 Example
data(faithful)
<- lm(waiting ~ eruptions, faithful)
model summary(model)
##
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0796 -4.4831 0.2122 3.9246 15.9719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.4744 1.1549 28.98 <2e-16 ***
## eruptions 10.7296 0.3148 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
Useful functions to extract data from model:
summ <- summary(model)
coef(model)
gives coefficientsfitted(model)
returns the vector of the fitted values, \(\hat{y}_i = b_0 + b_1 x_i\)resid(model)
(orsumm$residuals
) returns vector of residuals, \(e_i = y_i - \hat{y}_i\)summ$coefficients
gives more information on coefficient estimates (standard error, t-statistic, corresponding two-sided p-value)summ$sigma
extracts regression standard errorsumm$r.squared
returns value of \(R^2\)
3.2 Plotting an SLR Model
Using Base R
plot(faithful$waiting ~ faithful$eruptions, xlab="Eruption Time (m)",
ylab="Waiting Time Between Eruptions (m)", pch=16, col="cornflowerblue")
abline(model, col="red")
3.3 Diagnostic Plots and Residual Analysis
Infant Mortality and GDP Example from MLLN Notes Section 3.3
<- fit<- lm(infantMortality ~ ppgdp, data=newUN)
model1 plot(newUN$infantMortality ~ newUN$ppgdp, xlab="GDP per Capita",
ylab="Infant mortality (per 1000 births)", pch=16, col="cornflowerblue",
main="Model 1")
abline(model1,col="red")
<- lm(log(infantMortality) ~ log(ppgdp), data=newUN)
model2 plot(log(newUN$infantMortality) ~ log(newUN$ppgdp), pch=16, col="cornflowerblue",
main="Model 2")
abline(model2,col="red")
model1
clearly doesn’t fit SLR, we can confirm this be looking at diagnostic plotsmodel2
which is a transformation fits better and is an example of good diagnostic plots
Residual Plot
plot(model1, which=1, pch=16, col="cornflowerblue", main="Model 1 Residual Plot")
Comparison for Model 1 (poor fit) and Model 2 (good fit)
par(mfrow=c(1,2))
plot(model1, which=1, pch=16, col="cornflowerblue", main="Model 1 Residual Plot")
plot(model2,which=1,pch=16,col="cornflowerblue", main="Model 2 Residual Plot")
Example Using Simpler Methods (From Practical 1)
model
is as from Practical 1
par(mfrow=c(1,2))
plot(y = resid(model), x=fitted(model)) #residuals against fitted values
plot(y = resid(model), x=d) #residuals against raw values
Residual Q-Q Plot and Histogram
Residual Q-Q Plot
plot(model1, which=2, pch=16, col="cornflowerblue", main="Model 1 Q-Q Plot")
Residual Histogram
hist(resid(model1), col="cornflowerblue", main="Model 1 Residual Histogram")
Comparison for Model 1 (poor fit) and Model 2 (good fit)
par(mfrow=c(2,2))
# Model 1 (before transformation)
plot(model1, which = 2,pch=16, col="cornflowerblue", main="Model 1 Q-Q")
hist(resid(model1),col="cornflowerblue", main="Model 1 Resid Hist")
# Model 2 (after transformation)
plot(model2, which = 2, pch=16, col="hotpink3", main="Model 2 Q-Q")
hist(resid(model2),col="hotpink3", main="Model 2 Resid Hist")
Example Using Simpler Methods (From Practical 1)
model
is as from Practical 1
par(mfrow=c(1,2))
hist(resid(model))
qqnorm(resid(model))
3.4 Transforming Regression Variables
When a linear regression model doesn’t look like a good fit, it may be appropriate to transform one or both of the variables.
Example Making Log transformation (for Y and X)
<- lm(log(Y) ~ log(X), data) transformed_model
Example Making Polynomial Transformation
<- lm(Y ~ I(X^2), data) # I() is a general wrapper transformed_model
Using ggplot
Change the formula
parameter in geom_smooth
3.5 Confidence and Prediction Intervals
3.5.1 Confidence Intervals
Easiest to use confint()
. confint(model, level=...)
gives a confidence interval for each coefficient.
Example from Notes
<-data.frame(Price=c(85,103,70,82,89,98,66,95,169,70,48),
carSalesAge=c(5,4,6,5,5,5,6,6,2,7,7))
<- lm(Price ~ Age, carSales)
reg confint(reg, level=0.95) #CI for parameters
## 2.5 % 97.5 %
## (Intercept) 160.99243 229.94451
## Age -26.59419 -13.92833
Practical 1 Example
Again, model
is as from Practical 1
<- coef(model)[2]
beta1hat <- summary(model)$coefficients[2,2]
se.beta1 <- length(w)
n
#The Confidence Interval is
+ c(-1,1) * qt(0.975, n-2) * se.beta1 beta1hat
## [1] 10.10996 11.34932
# or
confint(model, level=0.95)[2,]
## 2.5 % 97.5 %
## 10.10996 11.34932
CI at a new point
When newpoints
is a data frame with same x
column name and a new point you can calculate the model CI around that point using predict()
with interval = "confidence"
.
predict(model, newdata = newpoints, interval = "confidence", level = 0.95)
3.5.2 Prediction Intervals
Easiest to use predict()
. To get a prediciton interval around a new point, use predict()
with interval = "prediction"
.
predict(model, newdata = newpoints, interval = "prediction", level = 0.95)
3.5.3 Plotting Confidence and Prediction Intervals
Base R
To plot a CI or PI line, use seq(a, b, by = ...)
and predict()
with newdata = data.frame(X = seq(a, b, by = ...))
and then plot the output using abline()
. Using interval = "confidence"
or interval = "prediction"
depending on if you want a CI or a PI.
ggplot
Confidence Interval
Let se = TRUE
in geom_smooth
.
ggplot
Prediction Interval:
<- predict(model, interval="prediction")
temp_var <- cbind(data1, temp_var)
new_df
ggplot(new_df, aes(x=X, y=Y))+
geom_point() +
geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
geom_line(aes(y=upr), color = "red", linetype = "dashed")+
geom_smooth(method=lm, se=TRUE)