There are two ways to use the lm(y~x)
command to create a linear model object. As an example we use the mtcars
data set.
fitted_model_syntax1 = lm(mtcars$mpg ~ mtcars$disp) # just vectors
fitted_model_syntax2 = lm(mpg~disp, data =mtcars) # preferred way for readability
In order to statistical inference for linear regression, we need to check the model assumptions. See this website from Boston University for a reminder of the model assumptions and how to check them. Since this isn’t a regression course, let’s just look at two diagnostic plots: residuals versus fitted values, and the qqplot. Read only the model assumptions and the short discussion of those two plots on the previous website.
The assumptions for linear regression do not appear to be satisfied for the regression of mpg
onto disp
in the example below, which means inferences will not be reliable (standard errors and \(p\)-values are not reliable).
plot(fitted_model_syntax2, which = 1:2)
fitted_model = lm(mpg ~ wt + disp+drat, data=mtcars)
summary(fitted_model)
##
## Call:
## lm(formula = mpg ~ wt + disp + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2342 -2.3719 -0.3148 1.6315 6.2820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.043257 7.099792 4.372 0.000154 ***
## wt -3.172482 1.217157 -2.606 0.014495 *
## disp -0.016389 0.009578 -1.711 0.098127 .
## drat 0.843965 1.455051 0.580 0.566537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.951 on 28 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.7603
## F-statistic: 33.78 on 3 and 28 DF, p-value: 1.92e-09
The predict
function uses a fitted model a data frame of predictors to return a vector of predicted values, one for each row of the data frame. What is a prediction? A prediction is the output of the model function evaluated on a row of inputs. See pages 300- 301 Paul Teetor’s R Cookbook on how to use the predict
function.
ggfortify
Package to Plot Density Functions for Common Continuous Random VariablesSee the website for an introduction to the ggfortify
package and the ggdistribution
command. The syntax is similar to ggplot, and we use the standard R commands for density names, which begin with d
.
# install.packages("ggfortify")
library(ggfortify)
## Loading required package: ggplot2
ggdistribution(dnorm, seq(-3.5, 3.5, 0.1), mean = 0, sd = 1)+
ggtitle("Standard Normal Density")
Let’s plot the density functions for several common continuous random variables. We use the function grid.arrange()
from the package gridExtra
in order to make several plots in a grid.
# install.packages("gridExtra")
library(gridExtra)
g1=ggdistribution(dnorm, seq(-3.5, 3.5, 0.1), mean = 0, sd = 1)+
ggtitle("Standard Normal Density")
g2=ggdistribution(dexp, seq(-.5, 4, 0.1), rate =1)+
ggtitle("Exponential Density\nwith Rate 1")
g3=ggdistribution(dunif, seq(-3.5, 3.5, 0.1), min = -.5, max = 1)+
ggtitle("Uniform Density on [-.5,1]")
g4=ggdistribution(dchisq, seq(-1, 12, 0.1), df=4)+
ggtitle("Chi-Squared Density\nwith 4 Degrees of Freedom")
grid.arrange(g1, g2, g3, g4, nrow=2)
See pages 186 - 190 of Paul Teetor’s R Cookbook.
plot
Command to Plot Probability Mass Functions for Common Discrete Random VariablesA discrete random variable is a random variable that takes on only finitely many or only countably many values.
Recall that the probability mass function (pmf) of a discrete random variable \(X\) is the function \[f\colon \text{image}(X) \to [0,1]\] defined by the formula \[f(x) = P(X=x).\] By default, the function \(f\) is defined to be zero outside of \(\text{image}(X)\).
Let’s plot the binomial pmf with \(n=2\) and success probability .5.
pmf_inputs = 0:2
pmf_values = dbinom(pmf_inputs, size = 2, prob = .5)
plot(pmf_inputs, pmf_values, type="h", col="blue",
main="Binomial PMF with n=2 and Success Probability .5",
xlab= "x",
ylab= "pmf")
Exercise: Plot the Poisson pmf with mean 2. Limit the inputs to 0 through 8 for readability. The command for the Poisson pmf is dpois
and the parameter for the mean is lambda
.
Recall that the cumulative distribution function (cdf) of a random variable is the function \(F \colon \mathbb{R} \to \mathbb{R}\) defined by \[F(x) = P(X \leq x).\] In R, the prefix p
is used for the cdf. \[\texttt{prootname}(x,...)=P(X \leq x).\]
We can use this to find the probability \(X\) is in a half-open interval. \[P(a<X\leq b)=P(X\leq b) - P(X \leq a) = \texttt{prootname}(b,...)-\texttt{prootname}(a,...).\] For continuous random variables the strictness of the inequality doesn’t matter: \(P(a<X\leq b)=P(a\leq X\leq b)\), and similarly for the other side. But for discrete random variables the strictness of the inequality actually matters: \(P(a<X\leq b)\neq P(a\leq X\leq b)\), etc.
For a discrete random variable \(X\), we have \[P(X \leq x) = \sum_{x' \leq x} f(x').\] Example: Suppose \(X\) is a binomial random variable with \(n=2\) and success probability .5. The pmf of \(X\) is pictured above. Find the probability \(P(0<X \leq 2)\) using R, and using the previous sum formula together with the pmf graph above.
pbinom(2,size=2,prob=.5) - pbinom(0, size=2,prob=.5)
## [1] 0.75
Exercise: Suppose \(X\) is a binomial random variable with \(n=20\) and success probability .3. Find the probability \(P(4<X \leq 13)\) using R.
See the function, loop, and histogram on the website of Professor Matthew Stephens of the University of Chicago.
Download the files from Canvas.