In this chapter, you will learn about statistical inference at the model-level for regression models. To do so, we will use the keith-gpa.csv data to examine whether time spent on homework is related to GPA. The data contain three attributes collected from a random sample of keith
.
# Load libraries
library(broom)
library(corrr)
library(dplyr)
library(ggplot2)
library(readr)
# Read in data
= read_csv(file = "https://raw.githubusercontent.com/zief0002/modeling/master/data/keith-gpa.csv")
keith head(keith)
# A tibble: 6 × 3
gpa homework parent_ed
<dbl> <dbl> <dbl>
1 78 2 13
2 79 6 14
3 79 1 13
4 89 5 13
5 82 3 16
6 77 4 13
In the previous chapter, we looked at how to carry out statistical tests of hypotheses and quantify uncertainty associated with the coefficients in the simple regression model. Sometimes you are interested in the model as a whole, rather than the individual parameters. For example, you may be interested in whether a set of predictors together explains variation in the outcome. Model-level information is displayed using the glance()
output from the broom package. Below we fit a model by regressing GPA on time spent on homework, store those results in an object called lm.a
, and then print the model-level output.
# Fit regression model
= lm(gpa ~ 1 + homework, data = keith)
lm.a
# Model-level output
glance(lm.a)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.107 0.0981 7.24 11.8 0.000885 1 -339. 684. 691.
deviance df.residual nobs
<dbl> <int> <int>
1 5136. 98 100
The r.squared
column indicates the proportion of variation in the outcome explained by differences in the predictor in the sample. Here, differences in time spent on homework explains 10.7% of the variation in students’ GPAs for the 100 students in the sample. The inferential question at the model-level is: Does the model explain variation in the outcome, in the population? This can formally be expressed in a statistical hypothesis as,
To test this, we need to be able to obtain the sampling distribution of
Figure 2: Thought experiment for sampling samples of size n from the population to obtain the sampling distribution of R-squared.
Below is a density plot of the sampling distribution for
Figure 13: Sampling distribution based on 1000 simple random samples of size 32 drawn from a population where
Most of the
In theoretical statistics the F-distribution is the ratio of two chi-squared statistics,
where
In regression analysis, the F-distribution associated with model-level inference is based on the following degrees of freedom:
where p is the number of predictors used in the model and
The F-distribution is the sampling distribution of F-values (not
In our example,
Thus, our observed F-value is:
Figure 3: Plot of the probability curve for the F(1,98) distribution. The shaded area under the curve represents the p-value for a test evaluating whether the population rho-squared is zero using an observed F-value of 11.74.
This area (which is one-sided in the
In practice, all of this information is provided in the output of the glance()
function.
glance(lm.a)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.107 0.0981 7.24 11.8 0.000885 1 -339. 684. 691.
deviance df.residual nobs
<dbl> <int> <int>
1 5136. 98 100
The observed F-value is given in the statistic
column and the associated degrees of freedom are provided in the df
and df.residual
columns. Lastly, the p-value is given in the p.value
column. When we report results from an F-test, we need to report the values for both degrees of freedom, the F-value, and the p-value.
The model-level test suggested that the empirical data are not consistent with the null hypothesis that the model explains no variation in GPAs;
, .
We can also get the model-level inferential information from the anova()
output. This gives us the ANOVA decomposition for the model.
anova(lm.a)
Analysis of Variance Table
Response: gpa
Df Sum Sq Mean Sq F value Pr(>F)
homework 1 616.5 616.54 11.763 0.0008854 ***
Residuals 98 5136.4 52.41
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that the two df values for the model-level F-statistic correspond to the df in each row of the ANOVA table. The first df (in this case, 1) is the model degrees-of-freedom, and the second df (in this case, 98) is the residual degrees-of-freedom. Note the p-value is the same as that from the glance()
function.
This ANOVA decomposition also breaks out the sum of squared values into the variation explained by the model (616.5) and that which is unexplained by the model (residual variation; 5136.4). Summing these two values will give the total amount of variation which can be used to compute
This decomposition also gives us another way to consider the F-statistic. Recall that the F-statistic had a direct relationship to
Since
Using algebra,
This expression of the F-statistic helps us see that the F-statistic is proportional to the ratio of the explained and unexplained variation. So long as the degrees of freedom remain the same, if the model explains more variation, the numerator of the F-statistic gets larger and the denominator will be smaller. Thus, larger F-values are associated with more explained variation by the model. We could also have seen this in the earlier expression of the F-statistic using
In statistical theory, a sum of squares divided by a degrees of freedom is referred to as a mean squared value—the average amount of variation per degree of freedom. Since
Thus the F-value is the ratio of the average variation explained by the model and the average variation that remains unexplained. In our example
These values are also printed in the anova()
output.
anova(lm.a)
Analysis of Variance Table
Response: gpa
Df Sum Sq Mean Sq F value Pr(>F)
homework 1 616.5 616.54 11.763 0.0008854 ***
Residuals 98 5136.4 52.41
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The observed F-value of 11.76 indicates that the average explained variation is 11.76 times that of the average unexplained variation. There is an awful lot more explained variation than unexplained variation, on average. Another name for a mean squared value is a variance estimate. A variance estimate is literally the average amount of variation (in the squared metric) per degree of freedom. For example, go back to the introductory statistics formula for using sample data to estimate a variance:
This numerator is a sum of squares; namely the
Note that the anova()
output. However, it can be computed from the values that are printed. The
Then the
Since this is a estimate of the outcome variable’s variance, we could also have computes the sample variance of the outcome variable, gpa
, using the var()
function.
%>%
keith summarize(V_gpa = var(gpa))
# A tibble: 1 × 1
V_gpa
<dbl>
1 58.1
The total mean square, or variance estimate, is also the mean square estimate of the residuals from the intercept-only model.
# Fit intercept-only model
.0 = lm(gpa ~ 1, data = keith)
lm
# ANOVA decomposition
anova(lm.0)
Analysis of Variance Table
Response: gpa
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 99 5752.9 58.11
Remember the sum of squared residuals is
Lastly, we point out that in simple regression models (models with only one predictor), the results of the model-level inference (i.e., the p-value) is exactly the same as that for the coefficient-level inference for the slope.
# Model-level inference
glance(lm.a)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.107 0.0981 7.24 11.8 0.000885 1 -339. 684. 691.
deviance df.residual nobs
<dbl> <int> <int>
1 5136. 98 100
# Coefficient-level inference
tidy(lm.a)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 74.3 1.94 38.3 1.01e-60
2 homework 1.21 0.354 3.43 8.85e- 4
That is because the model is composed of a single predictor, so asking whether the model accounts for variation in GPA is the same as asking whether GPA is different, on average, for students who spend a one-hour difference in time on homework. Once we have multiple predictors in the model, the model-level results and predictor-level results will not be the same.
Re-consider our thought experiment. Again, imagine you have a population that is infinitely large. The observations in this population have two attributes, call them
Figure 15: Thought experiment for sampling samples of size n from the population to obtain the fitted regression line.
Now, imagine superimposing all of these lines on the same plot.
Figure 17: Plot showing the fitted regression lines for many, many random samples of size n.
Examining where the sampled lines fall gives a visual interpretation of the uncertainty in the model. This two-dimensional display of uncertainty is referred to as a confidence envelope. In practice we estimate the uncertainty from the sample data and plot it around the fitted line from the sample.
For simple regression models, we can plot this directly by including the the geom_smooth()
layer in our plot. This adds a smoother to the plot. To add the fitted simple regression line, we use the argument method="lm"
. This will add the fitted regression line and confidence envelope to the plot based on fitting a linear model to the variables included in the x=
and y=
arguments in the aesthetic mapping defined in aes()
.11 The color of the fitted line and of the confidence envelope can be set using color=
and fill=
respectively.
# Create plot
ggplot(data = keith, aes(x = homework, y = gpa)) +
geom_smooth(method = "lm", color = "#c62f4b", fill = "#696969") +
xlab("Time spent on homework") +
ylab("GPA (on a 100-pt. scale)") +
theme_bw()
Figure 16: GPA plotted as a function of time spent on homework. The OLS regression line (raspberry) and confidence envelope (grey shaded area) are also displayed.
Note that we want to indicate the confidence envelope or make reference to the uncertainty in the figure caption. We pointed out that the confidence envelope indicates uncertainty by displaying the sampling variation associated with the location of the fitted regression line.
We can also use this plot to make inferences about the mean
However, we also now understand that there is uncertainty associated with estimates obtained from sample data. How much uncertainty is there in that estimate of 81.6? We can use the bounds of the confidence envelope at
Figure 18: GPA plotted as a function of time spent on homework. The OLS regression line (raspberry) and confidence ebnvelope (grey shaded area) are also displayed. The fitted value at X=6 is displayed as a point and the uncertainty in the estimate is displayed as an error bar.
This uncertainty estimate is technically a 95% confidence interval for the mean GPA for students who spend 6 hours each week on homework. As such, a more formal interpretation is:
With 95% confidence, the mean GPA for students who spend 6 hours each week on homework is between 80 and 83.
Notice that there is more uncertainty for the mean GPA for some values of
The confidence envelope can be omitted by using the argument se=FALSE
.↩︎