Stats 535 Lecture 6: More Regression with Linear Models, Math and Probability in R, Metropolis-Hastings Algorithm, k-Nearest Neighbors and Classification of Handwritten Digits

Highlight of Today: Showcase the Power of the Basic R Programming we Learned with Two Substantial Applications: Metropolis-Hastings Algorithm and k-Nearest Neighbors

Homework

Download this File as Rmd

Lecture 6

More Regression with Linear Models

Simple Linear Regression

There are two ways to use the lm(y~x) command to create a linear model object. As an example we use the mtcars data set.

fitted_model_syntax1 = lm(mtcars$mpg ~ mtcars$disp) # just vectors
fitted_model_syntax2 = lm(mpg~disp, data =mtcars) # preferred way for readability

In order to statistical inference for linear regression, we need to check the model assumptions. See this website from Boston University for a reminder of the model assumptions and how to check them. Since this isn’t a regression course, let’s just look at two diagnostic plots: residuals versus fitted values, and the qqplot. Read only the model assumptions and the short discussion of those two plots on the previous website.

The assumptions for linear regression do not appear to be satisfied for the regression of mpg onto disp in the example below, which means inferences will not be reliable (standard errors and \(p\)-values are not reliable).

plot(fitted_model_syntax2, which = 1:2)

Multiple Linear Regression

fitted_model = lm(mpg ~ wt + disp+drat, data=mtcars)
summary(fitted_model)

## 
## Call:
## lm(formula = mpg ~ wt + disp + drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2342 -2.3719 -0.3148  1.6315  6.2820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.043257   7.099792   4.372 0.000154 ***
## wt          -3.172482   1.217157  -2.606 0.014495 *  
## disp        -0.016389   0.009578  -1.711 0.098127 .  
## drat         0.843965   1.455051   0.580 0.566537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.951 on 28 degrees of freedom
## Multiple R-squared:  0.7835, Adjusted R-squared:  0.7603 
## F-statistic: 33.78 on 3 and 28 DF,  p-value: 1.92e-09

The predict function uses a fitted model a data frame of predictors to return a vector of predicted values, one for each row of the data frame. What is a prediction? A prediction is the output of the model function evaluated on a row of inputs. See pages 300- 301 Paul Teetor’s R Cookbook on how to use the predict function.

Probability in R

Use `ggfortify` Package to Plot Density Functions for Common Continuous Random Variables

See the website for an introduction to the ggfortify package and the ggdistribution command. The syntax is similar to ggplot, and we use the standard R commands for density names, which begin with d.

# install.packages("ggfortify")
library(ggfortify)

## Loading required package: ggplot2

ggdistribution(dnorm, seq(-3.5, 3.5, 0.1), mean = 0, sd = 1)+
     ggtitle("Standard Normal Density")

Let’s plot the density functions for several common continuous random variables. We use the function grid.arrange() from the package gridExtra in order to make several plots in a grid.

# install.packages("gridExtra")
library(gridExtra)
g1=ggdistribution(dnorm, seq(-3.5, 3.5, 0.1), mean = 0, sd = 1)+
     ggtitle("Standard Normal Density")
g2=ggdistribution(dexp, seq(-.5, 4, 0.1), rate =1)+
     ggtitle("Exponential Density\nwith Rate 1")
g3=ggdistribution(dunif, seq(-3.5, 3.5, 0.1), min = -.5, max = 1)+
     ggtitle("Uniform Density on [-.5,1]")
g4=ggdistribution(dchisq, seq(-1, 12, 0.1), df=4)+
     ggtitle("Chi-Squared Density\nwith 4 Degrees of Freedom")

grid.arrange(g1, g2, g3, g4, nrow=2)

Use the Cumulative Distribution Functions in R to Compute Probabilities

See pages 186 - 190 of Paul Teetor’s R Cookbook.

Use Base R `plot` Command to Plot Probability Mass Functions for Common Discrete Random Variables

A discrete random variable is a random variable that takes on only finitely many or only countably many values.

Recall that the probability mass function (pmf) of a discrete random variable \(X\) is the function \[f\colon \text{image}(X) \to [0,1]\] defined by the formula \[f(x) = P(X=x).\] By default, the function \(f\) is defined to be zero outside of \(\text{image}(X)\).

Let’s plot the binomial pmf with \(n=2\) and success probability .5.

pmf_inputs = 0:2
pmf_values = dbinom(pmf_inputs, size = 2, prob = .5)
plot(pmf_inputs, pmf_values, type="h", col="blue",
     main="Binomial PMF with n=2 and Success Probability .5",
     xlab= "x",
     ylab= "pmf")

Exercise: Plot the Poisson pmf with mean 2. Limit the inputs to 0 through 8 for readability. The command for the Poisson pmf is dpois and the parameter for the mean is lambda.

Example for Computing a Probability of a Discrete Random Variable

Recall that the cumulative distribution function (cdf) of a random variable is the function \(F \colon \mathbb{R} \to \mathbb{R}\) defined by \[F(x) = P(X \leq x).\] In R, the prefix p is used for the cdf. \[\texttt{prootname}(x,...)=P(X \leq x).\]

We can use this to find the probability \(X\) is in a half-open interval. \[P(a<X\leq b)=P(X\leq b) - P(X \leq a) = \texttt{prootname}(b,...)-\texttt{prootname}(a,...).\] For continuous random variables the strictness of the inequality doesn’t matter: \(P(a<X\leq b)=P(a\leq X\leq b)\), and similarly for the other side. But for discrete random variables the strictness of the inequality actually matters: \(P(a<X\leq b)\neq P(a\leq X\leq b)\), etc.

For a discrete random variable \(X\), we have \[P(X \leq x) = \sum_{x' \leq x} f(x').\] Example: Suppose \(X\) is a binomial random variable with \(n=2\) and success probability .5. The pmf of \(X\) is pictured above. Find the probability \(P(0<X \leq 2)\) using R, and using the previous sum formula together with the pmf graph above.

pbinom(2,size=2,prob=.5) - pbinom(0, size=2,prob=.5)

## [1] 0.75

Exercise: Suppose \(X\) is a binomial random variable with \(n=20\) and success probability .3. Find the probability \(P(4<X \leq 13)\) using R.

Metropolis-Hastings Algorithm

See the function, loop, and histogram on the website of Professor Matthew Stephens of the University of Chicago.

k-Nearest Neighbors and Classification of Handwritten Digits

Download the files from Canvas.

Stats 535 Lecture 6: More Regression with Linear Models, Math and Probability in R, Metropolis-Hastings Algorithm, k-Nearest Neighbors and Classification of Handwritten Digits

Thomas Fiore

May 29, 2019

Highlight of Today: Showcase the Power of the Basic R Programming we Learned with Two Substantial Applications: Metropolis-Hastings Algorithm and k-Nearest Neighbors

Homework

Download this File as Rmd

More Regression with Linear Models

Simple Linear Regression

Multiple Linear Regression

Probability in R

Use `ggfortify` Package to Plot Density Functions for Common Continuous Random Variables

Use the Cumulative Distribution Functions in R to Compute Probabilities

Use Base R `plot` Command to Plot Probability Mass Functions for Common Discrete Random Variables

Example for Computing a Probability of a Discrete Random Variable

Metropolis-Hastings Algorithm

k-Nearest Neighbors and Classification of Handwritten Digits

Stats 535 Lecture 6: More Regression with Linear Models, Math and Probability in R, Metropolis-Hastings Algorithm, k-Nearest Neighbors and Classification of Handwritten Digits

Thomas Fiore

May 29, 2019

Highlight of Today: Showcase the Power of the Basic R Programming we Learned with Two Substantial Applications: Metropolis-Hastings Algorithm and k-Nearest Neighbors

Homework

Download this File as Rmd

More Regression with Linear Models

Simple Linear Regression

Multiple Linear Regression

Probability in R

Use ggfortify Package to Plot Density Functions for Common Continuous Random Variables

Use the Cumulative Distribution Functions in R to Compute Probabilities

Use Base R plot Command to Plot Probability Mass Functions for Common Discrete Random Variables

Example for Computing a Probability of a Discrete Random Variable

Metropolis-Hastings Algorithm

k-Nearest Neighbors and Classification of Handwritten Digits

Use `ggfortify` Package to Plot Density Functions for Common Continuous Random Variables

Use Base R `plot` Command to Plot Probability Mass Functions for Common Discrete Random Variables