Read Chapters 3, 4, 5, 6 of Matloff’s book, skipping the Extended Examples. Don’t worry about understanding everything, just focus on the essentials.
Finish up the first and second DataCamp course.
A matrix in R is a vector with a dimension attribute: the number of rows and the number of columns.
Thus, to define a matrix, we give a vector, and we indicate the number of rows and columns. The vector then fills in the empty matrix shape column by column by default.
A = matrix(10:15,nrow=3,ncol=2)
A # notice the matrix is filled in column by column!
## [,1] [,2]
## [1,] 10 13
## [2,] 11 14
## [3,] 12 15
dim(A) # the dimension command finds the number of rows and columns, in that order
## [1] 3 2
R does the arithmetic vectorlength=(number of rows)x(number of columns)
for us, so it is sufficient to just specify the vector and either the number of rows or the number of columns.
matrix(10:15,nrow=3) # left out ncol=2 and we get same matrix as above
## [,1] [,2]
## [1,] 10 13
## [2,] 11 14
## [3,] 12 15
matrix(10:15,ncol=2) # left out nrow=3 and we get same matrix as above
## [,1] [,2]
## [1,] 10 13
## [2,] 11 14
## [3,] 12 15
If we want, we can fill in the matrix row by row (instead of column by column) by supplying the optional argument byrow=TRUE
.
B = matrix(10:15, nrow=3, ncol=2, byrow=TRUE)
B # notice this matrix is filled row by row, and is different from the matrix A above!
## [,1] [,2]
## [1,] 10 11
## [2,] 12 13
## [3,] 14 15
Warning: An R vector is neither an \(n\times 1\) matrix nor a \(1 \times n\) matrix!
is.matrix(c(1,2,3))
## [1] FALSE
dim(c(1,2,3))
## NULL
If you want to use an R vector as an \(n\times 1\) matrix or a \(1 \times n\) matrix, you must first turn the vector into a matrix as introduced above. Or we can set the dimension attribute as follows (recall a matrix in R is a vector with a dimension attribute).
x=c(1,2,3) # first make x a vector
dim(x) = c(1,3) # now we give x a dimension attribute to make it a row vector
x # we see x is now a row vector
## [,1] [,2] [,3]
## [1,] 1 2 3
dim(x) = c(3,1) # let's change the dimension attribute to make x a column vector
x # we see x is now a column vector
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
We can similarly create the matrix A
from above by first making a vector and then giving it a dimension attribute.
A_alternative = c(10,11,12,13,14,15) # first it is a vector
dim(A_alternative) = c(3,2) # then it becomes matrix
A_alternative # we see the same as A above
## [,1] [,2]
## [1,] 10 13
## [2,] 11 14
## [3,] 12 15
To index into a matrix, we use a double index.
A[3,2] # this accessess the entry in row 3 column 2
## [1] 15
To obtain an entire row or column as a vector, we leave one index blank. The blank index is the one that “runs.”
A # recall A from before
## [,1] [,2]
## [1,] 10 13
## [2,] 11 14
## [3,] 12 15
A[ ,1] # obtains the first column as a vector
## [1] 10 11 12
A[1, ] # obtains the first row as a vector
## [1] 10 13
is.matrix(A[ ,1]) # notice the first column is not a matrix!
## [1] FALSE
is.matrix(A[1, ]) # notice the first row is not a matrix!
## [1] FALSE
We can similarly select submatrices using vectors in the double index.
A # Recall A from before
## [,1] [,2]
## [1,] 10 13
## [2,] 11 14
## [3,] 12 15
A[c(1,3),1:2] # this is the submatrix consisting of the 4 corners
## [,1] [,2]
## [1,] 10 13
## [2,] 12 15
Warning: When you index into a matrix to obtain something one-dimensional (e.g. an entire row or column, or just part of a row or column), then R will return a vector rather than a submatrix! If you want to keep the extracted part as a submatrix rather than a vector, include the argument drop=FALSE
. See pages 80 and 81 of Matloff for more on unwanted dimension reduction.
A[1,1:2] # let's obtain the submatrix of top two corners, confusingly, it is actually a vector
## [1] 10 13
is.matrix(A[1,1:2])
## [1] FALSE
A[1,1:2,drop=FALSE] # include drop=FALSE to make the top 2 corners be a submatrix, we see it prints like a matrix
## [,1] [,2]
## [1,] 10 13
is.matrix( A[1,1:2,drop=FALSE] ) # indeed, it is a matrix
## [1] TRUE
We can do all the standard linear algebra for numerical matrices in R.
Matrix Multiplication: A %*% B
Scalar Multiplication: 5*A
Matrix Addition: A + B
Matrix Transposition: t(A)
Determinant of a Square Matrix: det(A)
Inverse of an Invertible Square Matrix: solve(A)
Eigenvalues and Eigenvectors of a Square Matrix: eigen(A)
apply
Function for Evaluating a Function on the Rows or Columns of a MatrixSuppose we have a function f()
which takes a vector as input and returns a scalar as output. For instance, f()
could be mean()
, median()
,var()
, sd()
, max()
, min()
, …
The apply
function allows us to apply the function f()
to the individual rows of a matrix and concatenate the outputs f(row_i)
into a vector.
Similarly, by changing one parameter, the apply
function allows us to apply the function f()
to the individual columns of a matrix and concatenate the outputs f(column_j)
into a vector.
Example: For the following matrix, find the median of each row and column and put them into two vectors. Recall that when we have an odd number of elements arranged in order, the median is the middle number. But if we have an even number of elements, the median is the average of the two middle numbers.
M = matrix(c(1,2,3,5,6,4),nrow=2,ncol=3,byrow=TRUE)
M
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 5 6 4
apply(M, 1, median) # the 1 means apply to the rows
## [1] 2 5
apply(M, 2, median) # the 2 means apply to the columns
## [1] 3.0 4.0 3.5
Notice the output is a vector!
The apply
function also works when f()
is vector valued, but the output of the apply
function will be a matrix. You need to consider whether you want the resulting matrix or its transpose. In such a situation, try it out and see what you get.
cbind()
and rbind()
colSums()
and rowSums()
colMeans()
and rowMeans()
A list is like a vector, but allows entry types to be mixed, and even allows nested multi-level structure. Let’s create an example of a list.
student = list(firstname='Mary', entrance_year=2019, honors=TRUE)
student # notice it prints differently from a vector, here the name of each entry is indicated $entryname
## $firstname
## [1] "Mary"
##
## $entrance_year
## [1] 2019
##
## $honors
## [1] TRUE
To access an element of a list and return that element with its data type, we have three options:
use the dollar sign $
and the element’s name without quotes, as in listname$entryname
use double brackets [[ ]]
and the element’s name with quotes, as in listname[['entryname']]
use double brackets [[ ]]
and the element’s index number without quotes, as in listname[[5]]
.
student$entrance_year # notice no quotes, returns native data type
## [1] 2019
student[['entrance_year']] # notice quotes, returns native data type
## [1] 2019
student[[2]] # notice no quotes, returns native data type
## [1] 2019
On the other hand, to return an element as a sublist, use a single bracket []
with the element’s name in quotes, or the index number without quotes.
student['entrance_year'] # this is a sublist!
## $entrance_year
## [1] 2019
student[2] # this is a sublist
## $entrance_year
## [1] 2019
To obtain the vector of names of elements of a list, we use the function names()
as we did to obtain the vector of names of a vector. The length()
function also works analogously.
names(student)
## [1] "firstname" "entrance_year" "honors"
length(student)
## [1] 3
We can add a new list element just like we did for vectors: index in with a new index and assign.
student[["graduation_year"]] = 2021
student
## $firstname
## [1] "Mary"
##
## $entrance_year
## [1] 2019
##
## $honors
## [1] TRUE
##
## $graduation_year
## [1] 2021
We can turn a list into a vector using the unlist()
function. This will coerce to the most flexible type.
unlist(student) # makes a character vector
## firstname entrance_year honors graduation_year
## "Mary" "2019" "TRUE" "2021"
lapply
for Evaluating a Function on the Elements of a List to Get Another List, and its Sister sapply
to Get a VectorWe learned about the apply
function above for evalating a function f()
on the rows or columns of a matrix to get a vector. The function lapply
is the analogue for lists. We don’t specify 1
or 2
because there are no rows or columns.
numberlist = list(A=1:4, B=1:5, C=1:6)
numberlist
## $A
## [1] 1 2 3 4
##
## $B
## [1] 1 2 3 4 5
##
## $C
## [1] 1 2 3 4 5 6
lapply(numberlist, median) # outputs a list of the medians
## $A
## [1] 2.5
##
## $B
## [1] 3
##
## $C
## [1] 3.5
sapply(numberlist, median) # outputs a vector of the medians
## A B C
## 2.5 3.0 3.5
The s
in sapply
stands for simplify
.
A data frame is a list in which the elements are vectors of equal length. We can think of it like a spreadsheet, the difference being that the column names of a data frame are not cells in the data frame. A data frame is like a matrix in that it has a 2-dimensional structure with rows and columns, but it differs from a matrix because a matrix requires all columns to have the same type, while a data frame does not.
A column name in a data frame is called a variable in the sense of statistics and the entries in a column are observations of that variable, or measurements of that variable. Notice that the statistical usage of the term variable for a column name is different from the usage of the term variable in programming!
Most data sets in R are data frames (or souped-up data frames called tibbles).
For instance, the built-in data set mtcars
is a data frame.
typeof(mtcars)
## [1] "list"
class(mtcars)
## [1] "data.frame"
is.data.frame(mtcars)
## [1] TRUE
# View(mtcars) # View with capital V to view it
names(mtcars) # names() returns the vector of names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
head(mtcars) # head() prints the first few rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
That’s really difficult to read, so we can use the function knitr::kable()
to improve readability in a knitted file
knitr::kable( head(mtcars) ) # head() prints the first few rows in table form
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
knitr::kable( tail(mtcars) ) # tail() prints the last few rows in table form
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.5 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.5 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.6 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.6 | 1 | 1 | 4 | 2 |
str(mtcars) # str() tells us about the structure, and lists the first measurements, the command str() applies to many other kinds of R objects as well
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars) # summary() gives us basic descriptive statistics, and helps us discover if a variable is incorrectly coded as continuous/factor variable, or if someone erroneously typed 0 for NA, or other problems
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We see that the number of cylinders should probably be coded as a factor, rather than continous, because fractional cylinders doesn’t make physical sense. Let’s recode is as a factor and see how the summary changes.
mtcars_new = mtcars # first define new as old
mtcars_new$cyl = as.factor(mtcars_new$cyl) # change the cylinders column into a factor variable
summary(mtcars_new) # notice the summary for cyl now tells us how many cars have 4, 6, 8 cylinders in the data set
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am
## Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
## Median :3.325 Median :17.71 Median :0.0000 Median :0.0000
## Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062
## 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000
## gear carb
## Min. :3.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000
We fixed cylinders. Do you see any other variables that should be recoded?….
Two functions are useful for an initial look at the data: pairs
and boxplot
.
pairs(mtcars) # pairs() makes pairwise scatterplots of all variables
What can you say about mpg
versus weight
? Scatterplots involving factor variables are not informative, so really we should only apply pairs()
to the continous variables.
For plotting continous versus categorical, it is more informative to look at box plots.
boxplot( mtcars$mpg ~ mtcars$cyl,
main = "Boxplots of Fuel Economy versus Cylinders",
xlab = "Number of Cylinders",
ylab = "Fuel Economy in mpg")
This completes a short survey of useful commands for a data frame. To initially explore a data frame, much more should be done under the heading of exploratory data analysis. We’ll return to that later.
Since a data frame is a list of vectors with the same length, and these vectors are the columns of the data frame, we can extract a column of a data frame in the same ways we extract an element of a list.
To access a column of a data frame as a vector, we have three options (and more):
use the dollar sign $
and the column’s name without quotes, as in dataframename$columnname
use double brackets [[ ]]
and the column’s name with quotes, as in dataframename[['columnname']]
use double brackets [[ ]]
and the columns’s index number without quotes, as in dataframename[[5]]
.
Additionally, we can use matrix-like double indices with single brackets to access a column of a data frame as a vector.
dataframename[ ,'columnname']
dataframename[ ,5]
We can use the matrix notation in 4. and 5. above to extract a 2-dimensional subarray, which is automatically a data frame.
mtcars[10:12,c('mpg','hp')]
## mpg hp
## Merc 280 19.2 123
## Merc 280C 17.8 123
## Merc 450SE 16.4 180
is.data.frame( mtcars[10:12,c('mpg','hp')] )
## [1] TRUE
drop=FALSE
to Make a Data FrameEarlier we discussed 5 ways to obtain a column (or part of a column) as a vector. But how do we obtain it as a data frame? Use the option drop=FALSE
mtcars[1:5,'hp'] # this is the vector of first 5 entries in the hp column
## [1] 110 110 93 110 175
is.data.frame( mtcars[1:5,'hp'] ) # it is not a data frame
## [1] FALSE
mtcars[1:5,'hp',drop=FALSE] # including drop=FALSE makes a data frame
## hp
## Mazda RX4 110
## Mazda RX4 Wag 110
## Datsun 710 93
## Hornet 4 Drive 110
## Hornet Sportabout 175
is.data.frame( mtcars[1:5,'hp', drop=FALSE] )
## [1] TRUE
We can extract a sub data frame by putting a column condition into the first part of the double index. Remember to use the dollar sign $
!
Let’s find the sub data frame of mtcars
that has fuel economy better than 25 mpg.
mtcars[mtcars$mpg>25, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
$
to Add a Column to a Data Frame with a Formula in Terms of Other ColumnsLet’s create a new data frame mtcars01
which has an additional column mpg01
which is 1 if a car has above average fuel economy, otherwise 0.
mtcars01 = mtcars # first define the new df to be the old df
mtcars01$mpg01 = numeric( length(mtcars$mpg) ) # append a column of zeros named mpg01
knitr::kable( head(mtcars01) ) # confirm new column by looking at a few rows
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | mpg01 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | 0 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | 0 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 0 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 0 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 | 0 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | 0 |
mean(mtcars$mpg) # find the average mpg for reference
## [1] 20.09062
mtcars01[ mtcars$mpg > mean(mtcars$mpg), 'mpg01' ] = 1 # change the new 0s to 1s as appropriate
knitr::kable( head(mtcars01) ) # confirm change worked by looking at a few rows
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | mpg01 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | 1 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | 1 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 | 0 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | 0 |
We have createt a new data frame mtcars01
which has an additional column mpg01
which is 1 if a car has above average fuel economy, otherwise 0.
lapply
and sapply
also Apply to Data Frames, Since Each Data Frame is a ListTip: remember lapply
outputs a list, so if you wanted a data frame, convert it using as.data.frame()
.