Homework

Matrices

A matrix in R is a vector with a dimension attribute: the number of rows and the number of columns.

Matrix Creation

Thus, to define a matrix, we give a vector, and we indicate the number of rows and columns. The vector then fills in the empty matrix shape column by column by default.

A = matrix(10:15,nrow=3,ncol=2)
A # notice the matrix is filled in column by column!
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15
dim(A) # the dimension command finds the number of rows and columns, in that order
## [1] 3 2

R does the arithmetic vectorlength=(number of rows)x(number of columns) for us, so it is sufficient to just specify the vector and either the number of rows or the number of columns.

matrix(10:15,nrow=3) # left out ncol=2 and we get same matrix as above
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15
matrix(10:15,ncol=2) # left out nrow=3 and we get same matrix as above
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15

If we want, we can fill in the matrix row by row (instead of column by column) by supplying the optional argument byrow=TRUE.

B = matrix(10:15, nrow=3, ncol=2, byrow=TRUE)
B # notice this matrix is filled row by row, and is different from the matrix A above!
##      [,1] [,2]
## [1,]   10   11
## [2,]   12   13
## [3,]   14   15

Warning: An R vector is neither an \(n\times 1\) matrix nor a \(1 \times n\) matrix!

is.matrix(c(1,2,3))
## [1] FALSE
dim(c(1,2,3))
## NULL

If you want to use an R vector as an \(n\times 1\) matrix or a \(1 \times n\) matrix, you must first turn the vector into a matrix as introduced above. Or we can set the dimension attribute as follows (recall a matrix in R is a vector with a dimension attribute).

x=c(1,2,3) # first make x a vector
dim(x) = c(1,3) # now we give x a dimension attribute to make it a row vector
x           # we see x is now a row vector
##      [,1] [,2] [,3]
## [1,]    1    2    3
dim(x) = c(3,1) # let's change the dimension attribute to make x a column vector
x # we see x is now a column vector
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

We can similarly create the matrix A from above by first making a vector and then giving it a dimension attribute.

A_alternative = c(10,11,12,13,14,15) # first it is a vector
dim(A_alternative) = c(3,2) # then it becomes matrix
A_alternative # we see the same as A above
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15

Accessing Entries and Submatrices of a Matrix

To index into a matrix, we use a double index.

A[3,2] # this accessess the entry in row 3 column 2
## [1] 15

To obtain an entire row or column as a vector, we leave one index blank. The blank index is the one that “runs.”

A # recall A from before
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15
A[ ,1] # obtains the first column as a vector
## [1] 10 11 12
A[1, ] # obtains the first row as a vector
## [1] 10 13
is.matrix(A[ ,1]) # notice the first column is not a matrix!
## [1] FALSE
is.matrix(A[1, ]) # notice the first row is not a matrix!
## [1] FALSE

We can similarly select submatrices using vectors in the double index.

A # Recall A from before
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15
A[c(1,3),1:2] # this is the submatrix consisting of the 4 corners
##      [,1] [,2]
## [1,]   10   13
## [2,]   12   15

Warning: When you index into a matrix to obtain something one-dimensional (e.g. an entire row or column, or just part of a row or column), then R will return a vector rather than a submatrix! If you want to keep the extracted part as a submatrix rather than a vector, include the argument drop=FALSE. See pages 80 and 81 of Matloff for more on unwanted dimension reduction.

A[1,1:2] # let's obtain the submatrix of top two corners, confusingly, it is actually a vector
## [1] 10 13
is.matrix(A[1,1:2])
## [1] FALSE
A[1,1:2,drop=FALSE] # include drop=FALSE to make the top 2 corners be a submatrix, we see it prints like a matrix
##      [,1] [,2]
## [1,]   10   13
is.matrix(  A[1,1:2,drop=FALSE]  ) # indeed, it is a matrix
## [1] TRUE

Linear Algebra Commands for Matrices

We can do all the standard linear algebra for numerical matrices in R.

  • Matrix Multiplication: A %*% B

  • Scalar Multiplication: 5*A

  • Matrix Addition: A + B

  • Matrix Transposition: t(A)

  • Determinant of a Square Matrix: det(A)

  • Inverse of an Invertible Square Matrix: solve(A)

  • Eigenvalues and Eigenvectors of a Square Matrix: eigen(A)

The apply Function for Evaluating a Function on the Rows or Columns of a Matrix

Suppose we have a function f() which takes a vector as input and returns a scalar as output. For instance, f() could be mean(), median() ,var(), sd(), max(), min(), …

The apply function allows us to apply the function f() to the individual rows of a matrix and concatenate the outputs f(row_i) into a vector.

Similarly, by changing one parameter, the apply function allows us to apply the function f() to the individual columns of a matrix and concatenate the outputs f(column_j) into a vector.

Example: For the following matrix, find the median of each row and column and put them into two vectors. Recall that when we have an odd number of elements arranged in order, the median is the middle number. But if we have an even number of elements, the median is the average of the two middle numbers.

M = matrix(c(1,2,3,5,6,4),nrow=2,ncol=3,byrow=TRUE)
M
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    5    6    4
apply(M, 1, median) # the 1 means apply to the rows
## [1] 2 5
apply(M, 2, median) # the 2 means apply to the columns
## [1] 3.0 4.0 3.5

Notice the output is a vector!

The apply function also works when f() is vector valued, but the output of the apply function will be a matrix. You need to consider whether you want the resulting matrix or its transpose. In such a situation, try it out and see what you get.

Other Useful Matrix Commands

cbind() and rbind()

colSums() and rowSums()

colMeans() and rowMeans()

Download the In-Class Exercise File

Lecture 2 In-Class Exercises

Lists

A list is like a vector, but allows entry types to be mixed, and even allows nested multi-level structure. Let’s create an example of a list.

student = list(firstname='Mary', entrance_year=2019, honors=TRUE)
student # notice it prints differently from a vector, here the name of each entry is indicated $entryname
## $firstname
## [1] "Mary"
## 
## $entrance_year
## [1] 2019
## 
## $honors
## [1] TRUE

To access an element of a list and return that element with its data type, we have three options:

  1. use the dollar sign $ and the element’s name without quotes, as in listname$entryname

  2. use double brackets [[ ]] and the element’s name with quotes, as in listname[['entryname']]

  3. use double brackets [[ ]] and the element’s index number without quotes, as in listname[[5]].

student$entrance_year # notice no quotes, returns native data type 
## [1] 2019
student[['entrance_year']] # notice quotes, returns native data type
## [1] 2019
student[[2]] # notice no quotes, returns native data type 
## [1] 2019

On the other hand, to return an element as a sublist, use a single bracket [] with the element’s name in quotes, or the index number without quotes.

student['entrance_year'] # this is a sublist!
## $entrance_year
## [1] 2019
student[2] # this is a sublist
## $entrance_year
## [1] 2019

To obtain the vector of names of elements of a list, we use the function names() as we did to obtain the vector of names of a vector. The length() function also works analogously.

names(student)
## [1] "firstname"     "entrance_year" "honors"
length(student)
## [1] 3

We can add a new list element just like we did for vectors: index in with a new index and assign.

student[["graduation_year"]] = 2021
student
## $firstname
## [1] "Mary"
## 
## $entrance_year
## [1] 2019
## 
## $honors
## [1] TRUE
## 
## $graduation_year
## [1] 2021

We can turn a list into a vector using the unlist() function. This will coerce to the most flexible type.

unlist(student) # makes a character vector
##       firstname   entrance_year          honors graduation_year 
##          "Mary"          "2019"          "TRUE"          "2021"

The Function lapply for Evaluating a Function on the Elements of a List to Get Another List, and its Sister sapply to Get a Vector

We learned about the apply function above for evalating a function f() on the rows or columns of a matrix to get a vector. The function lapply is the analogue for lists. We don’t specify 1 or 2 because there are no rows or columns.

numberlist = list(A=1:4, B=1:5, C=1:6)
numberlist
## $A
## [1] 1 2 3 4
## 
## $B
## [1] 1 2 3 4 5
## 
## $C
## [1] 1 2 3 4 5 6
lapply(numberlist, median) # outputs a list of the medians
## $A
## [1] 2.5
## 
## $B
## [1] 3
## 
## $C
## [1] 3.5
sapply(numberlist, median) # outputs a vector of the medians
##   A   B   C 
## 2.5 3.0 3.5

The s in sapply stands for simplify.

Data Frames

A data frame is a list in which the elements are vectors of equal length. We can think of it like a spreadsheet, the difference being that the column names of a data frame are not cells in the data frame. A data frame is like a matrix in that it has a 2-dimensional structure with rows and columns, but it differs from a matrix because a matrix requires all columns to have the same type, while a data frame does not.

A column name in a data frame is called a variable in the sense of statistics and the entries in a column are observations of that variable, or measurements of that variable. Notice that the statistical usage of the term variable for a column name is different from the usage of the term variable in programming!

Most data sets in R are data frames (or souped-up data frames called tibbles).

For instance, the built-in data set mtcars is a data frame.

typeof(mtcars)
## [1] "list"
class(mtcars)
## [1] "data.frame"
is.data.frame(mtcars)
## [1] TRUE

Useful Functions for Data Frames

# View(mtcars) # View with capital V to view it
names(mtcars) # names() returns the vector of names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
head(mtcars)  # head() prints the first few rows 
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

That’s really difficult to read, so we can use the function knitr::kable() to improve readability in a knitted file

knitr::kable(  head(mtcars)  ) # head() prints the first few rows in table form
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
knitr::kable(  tail(mtcars)  )  # tail() prints the last few rows in table form
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
str(mtcars) # str() tells us about the structure, and lists the first measurements, the command str() applies to many other kinds of R objects as well
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars) # summary() gives us basic descriptive statistics, and helps us discover if a variable is incorrectly coded as continuous/factor variable, or if someone erroneously typed 0 for NA, or other problems
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

We see that the number of cylinders should probably be coded as a factor, rather than continous, because fractional cylinders doesn’t make physical sense. Let’s recode is as a factor and see how the summary changes.

mtcars_new = mtcars # first define new as old
mtcars_new$cyl = as.factor(mtcars_new$cyl) # change the cylinders column into a factor variable
summary(mtcars_new) # notice the summary for cyl now tells us how many cars have 4, 6, 8 cylinders in the data set 
##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec             vs               am        
##  Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :3.325   Median :17.71   Median :0.0000   Median :0.0000  
##  Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
##  3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
##       gear            carb      
##  Min.   :3.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :2.000  
##  Mean   :3.688   Mean   :2.812  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :8.000

We fixed cylinders. Do you see any other variables that should be recoded?….

Two functions are useful for an initial look at the data: pairs and boxplot.

pairs(mtcars) # pairs() makes pairwise scatterplots of all variables

What can you say about mpg versus weight? Scatterplots involving factor variables are not informative, so really we should only apply pairs() to the continous variables.

For plotting continous versus categorical, it is more informative to look at box plots.

boxplot( mtcars$mpg ~ mtcars$cyl,
         main = "Boxplots of Fuel Economy versus Cylinders",
         xlab = "Number of Cylinders",
         ylab = "Fuel Economy in mpg")

This completes a short survey of useful commands for a data frame. To initially explore a data frame, much more should be done under the heading of exploratory data analysis. We’ll return to that later.

Extracting an Individual Column as a Vector

Since a data frame is a list of vectors with the same length, and these vectors are the columns of the data frame, we can extract a column of a data frame in the same ways we extract an element of a list.

To access a column of a data frame as a vector, we have three options (and more):

  1. use the dollar sign $ and the column’s name without quotes, as in dataframename$columnname

  2. use double brackets [[ ]] and the column’s name with quotes, as in dataframename[['columnname']]

  3. use double brackets [[ ]] and the columns’s index number without quotes, as in dataframename[[5]].

Additionally, we can use matrix-like double indices with single brackets to access a column of a data frame as a vector.

  1. dataframename[ ,'columnname']

  2. dataframename[ ,5]

Extracting a 2-Dimensional Sub Data Frame as a Data Frame using Matrix Notation

We can use the matrix notation in 4. and 5. above to extract a 2-dimensional subarray, which is automatically a data frame.

mtcars[10:12,c('mpg','hp')]
##             mpg  hp
## Merc 280   19.2 123
## Merc 280C  17.8 123
## Merc 450SE 16.4 180
is.data.frame( mtcars[10:12,c('mpg','hp')] )
## [1] TRUE

Extracting a 1-dimensional Sub Data Frame with Matrix Notation and drop=FALSE to Make a Data Frame

Earlier we discussed 5 ways to obtain a column (or part of a column) as a vector. But how do we obtain it as a data frame? Use the option drop=FALSE

mtcars[1:5,'hp'] # this is the vector of first 5 entries in the hp column
## [1] 110 110  93 110 175
is.data.frame( mtcars[1:5,'hp'] ) # it is not a data frame
## [1] FALSE
mtcars[1:5,'hp',drop=FALSE] # including drop=FALSE makes a data frame
##                    hp
## Mazda RX4         110
## Mazda RX4 Wag     110
## Datsun 710         93
## Hornet 4 Drive    110
## Hornet Sportabout 175
is.data.frame( mtcars[1:5,'hp', drop=FALSE] )
## [1] TRUE

How to Obtain a Sub Data Frame by Filtering the Rows According to a Condition on a Column

We can extract a sub data frame by putting a column condition into the first part of the double index. Remember to use the dollar sign $!

Let’s find the sub data frame of mtcars that has fuel economy better than 25 mpg.

mtcars[mtcars$mpg>25, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Use $ to Add a Column to a Data Frame with a Formula in Terms of Other Columns

Let’s create a new data frame mtcars01 which has an additional column mpg01 which is 1 if a car has above average fuel economy, otherwise 0.

mtcars01 = mtcars # first define the new df to be the old df
mtcars01$mpg01 = numeric( length(mtcars$mpg) ) # append a column of zeros named mpg01
knitr::kable( head(mtcars01) ) # confirm new column by looking at a few rows
mpg cyl disp hp drat wt qsec vs am gear carb mpg01
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0
mean(mtcars$mpg) # find the average mpg for reference
## [1] 20.09062
mtcars01[ mtcars$mpg > mean(mtcars$mpg), 'mpg01' ] = 1 # change the new 0s to 1s as appropriate 
knitr::kable( head(mtcars01) ) # confirm change worked by looking at a few rows   
mpg cyl disp hp drat wt qsec vs am gear carb mpg01
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0

We have createt a new data frame mtcars01 which has an additional column mpg01 which is 1 if a car has above average fuel economy, otherwise 0.

lapply and sapply also Apply to Data Frames, Since Each Data Frame is a List

Tip: remember lapply outputs a list, so if you wanted a data frame, convert it using as.data.frame().