Install R and RStudio following the instructions in this document.
Watch the videos about RStudio and R Markdown in the aforementioned document.
Read Sections 27.1 - 27.4.3 of Wickham’s book R for Data Science.
Download and save the R Markdown Cheatsheet for future reference. Read only the the left column of Page 2 called Pandoc’s Markdown. Don’t worry about the rest.
Do HW 1 on Canvas.
DataCamp course Introduction to R through our free class subscription.
Read lightly the Introduction, Chapter 1, and Chapter 2 of Matloff’s book. Don’t worry about understanding everything, just read it lightly. Focus on understanding Sections 1.2, 1.4, 1.7.1, 1.7.2, and all of Chapter 2 except the Extended Examples.
In this course, you will learn how to use R Markdown in RStudio in order to produce a complete record of your data analyses. Such a record would be a technical appendix to a report you are writing for your job or a scientific investigation. The audience for your technical appendix is yourself and anyone else interested in how you obtained the results!
Here is an example of a reproducibility appendix we discussed in class.
We will primarily interact with R through the interactive development environment called RStudio.
RStudio has multiple slidable windows that interact with each other. Here are few descriptions.
In the console we work interactively and type commands.
In the file editor window we can save our work to R Markdown files or to scripts, or other kinds of files. This is where we will do our homework. We can run a line to the console using CTRL+Enter. We can run a code block to the console with CTRL+SHIFT+Enter.
environment window
history window
help window (you can search in the help window, or you can type ?word
in the console and the answer will appear in the help window)
plots window.
Try it out!
In the console type 2+2
.
In the console type print("Hello world!")
.
In the console type mean(c(1,2,3))
.
In the console type x=5
.
In the console type ?mean
.
In the console type plot(1,2)
.
Then look at the plot window, environment window, history window, and help window.
This is how we will do non-DataCamp homework, and it is an important tool for reproducible research. With R Markdown we can provide a complete record of our code in html, pdf, or Word format, together with explanatory text. To do this, one writes a plain text file with .Rmd extension, and then one “knits” it with the knit button.
See Sections 27.1 - 27.4.3 of Wickham’s book R for Data Science.
Important points:
To open a new Rmd file in RStudio, use the File button.
Delete the generic statements.
Modify the YAML header to your name, etc.
To make a new section heading use
## Descriptive Heading Name.
Notice the space after ## and notice this is outside of codeblocks!
Include an Overview heading and External Requirements heading.
Code is written in codeblocks. Create a codeblock with the green insert button, or CTRL+ALT+I.
In a codeblock, # does not create a heading. Instead it creates a comment that is ignored by the computer.
To run the current line with the cursor from a codeblock to the console, use CTRL+Enter.
To run the current code block with the cursor to the console, use CTRL+SHIFT+Enter.
Outside of codeblocks, you can use LaTeX code between $ $ to make math formulas. For instance \(E=mc^2\) is typeset with $E=mc^2$.
Outside of codeblocks, backticks make a verbatim
enviroment. Here verbatim
was typeset with `verbatim`.
Knitting to PDF won’t work for you unless you already have a LaTeX installation that properly communicates with RStudio. So instead, to create a pdf file, knit to html or knit to Word, and pdf print that file from your browser or Word. If your computer does not yet have the capability to print to pdf, then install CutePDF or some similar free product.
If your Rmd file has a single error, it won’t knit. So, I recommend writing a small piece and then knit it, in order to find your errors and debug them. Run individual codeblocks to the command line to debug them, rather than knitting the whole file.
Look again at the example of a reproducibility appendix and read it (skip the long code for the graphs and the hypothesis test).
Code the pdf document passed out in class in RMarkdown, knit it to html and Word, pdf print it, and upload both to Canvas HW1.
Various operations and commands we will want to do for data analysis in R only work for certain kinds of objects. A common source of bugs is to attempt to apply a command to an object of the wrong type. So let’s understand the most elementary kinds of of objects with data types: scalars.
In R, a scalar is just a 1-dimensional vector. There are 4 data types of scalars (and vectors) we will deal with. These data types are also sometimes called modes.
Logical (aka Boolean): TRUE
and FALSE
, or equivalently T
and F
Integer: whole numbers such as -100, -1, 0, 1, 55, but with an L appended to indicate integer, so -100L
, -1L
, 0L
, 1L
, 55L
Double (aka Float): real numbers, such as -100, -1, 0, 1, 1.0, pi, .78, 34/59
Character (aka String): any sequence of symbols in single or equivalently double quotes, such as "Hello"
or equivalently 'Hello'
, or such as '12 abC'
. Notice: a space is a string character, and case matters.
To determine the type of an object, use the command typeof()
.
typeof(TRUE)
## [1] "logical"
typeof(-100L)
## [1] "integer"
typeof(-100)
## [1] "double"
typeof("Hello world!")
## [1] "character"
typeof("-100")
## [1] "character"
To ask if an object is a specific one of these types, use the commands: is.logical()
, is.integer()
, is.double()
, is.character()
.
is.logical(-100)
## [1] FALSE
is.integer(-100)
## [1] FALSE
is.double(-100)
## [1] TRUE
is.character(-100)
## [1] FALSE
==
, <=
, >=
, and !=
These also work for comparing vectors, as we shall soon see.
Variables do not need to be declared in advance in R. x=10
or x<-10
both save 10 to x. The symbol =
is faster to type, so I prefer that.
Notice that when we save 10
to x
in the console, nothing is returned.
Variables can be reassigned at will, even with different types. For instance we could subsequently type x="Cool!"
and now x
has this value.
R uses “pass by value”, which makes it easier to reason about code. For instance, in the following, changing x
later does not change y
.
x=10
y=x
x=20
y
## [1] 10
All elements of a vector must have the same data type (mode). Here are a few ways to create vectors.
c()
applied to elements “concatenates” them
1:5
creates a vector of integers
5:1
creates a vector of integers in the opposite order
c()
applied to vectors concatenates them in a flat way, it does not create a 2-level structure.
We can store a vector in a variable x
using x =
. We can give the entries names with
names(x)= c(a vector of names of same length)
x=5:9
names(x)=c("p","q","r","s")
x[3]
## r
## 7
x["r"]
## r
## 7
x[2:4]
## q r s
## 6 7 8
x[c(F,T,T,T)]
## q r s
## 6 7 8
c(1,2,3) + c(10,11,12) # add corresponding entries
## [1] 11 13 15
c(1,2,3) * c(10,11,12) # multiply corresponding entries
## [1] 10 22 36
c(1,2,3) / c(10,20,30) # divide corresponding entries
## [1] 0.1 0.1 0.1
5*c(1,2,3) # scalar multiplication, in other words multiply each entry by scalar
## [1] 5 10 15
c(1,2,3)*5 # also scalar multiplication
## [1] 5 10 15
c(1,2,3) +5 # add scalar to each entry
## [1] 6 7 8
c(1,2,3,4) >2 # do the logical comparison for each entry
## [1] FALSE FALSE TRUE TRUE
c(1,2,1,1,5) == 1 # do the logical comparison for each entry
## [1] TRUE FALSE TRUE TRUE FALSE
# to find all locations of the repeated 1's
c(1,2,3,4,5) == c(8,2,3,4,8) # do the logical comparison of corresponding entries
## [1] FALSE TRUE TRUE TRUE FALSE
Another aspect of vectorization is that many functions apply elementwise to vectors: squaring, square root, trig functions, exponentials, logarithms, etc.
c(1,2,3)^2
## [1] 1 4 9
sqrt(c(1,4,9))
## [1] 1 2 3
sin(c(0,pi/2,pi))
## [1] 0.000000e+00 1.000000e+00 1.224606e-16
exp(c(1,2,3))
## [1] 2.718282 7.389056 20.085537
log(c(exp(1),exp(2),exp(3))) # Aha! log means natural log, not base 10!
## [1] 1 2 3
log10(c(10^1,10^2,10^3))
## [1] 1 2 3
Recall that we have seen how to obtain a subvector from a vector using a logical vector: only the elements indexed by TRUE
s are included the subvector.
c('abc','mno','xyz')[c(TRUE,FALSE,TRUE)]
## [1] "abc" "xyz"
Recall also that we saw how to obtain a logical vector from a condition on a vector.
c(-10,10,0,40,-5) >0
## [1] FALSE TRUE FALSE TRUE FALSE
The combination of these two allow us to elegantly filter a vector according to a condition! Let’s find the subvector of the following vector consisting only of positive entries, and the subvector consisting only of negative entries, both using filtering.
x=c(-10,10,0,40,-5)
x[x>0]
## [1] 10 40
x[x<0]
## [1] -10 -5
Recall that we said a vector in R has all of its entries of the same data type (mode). They must all be logical, or all integer, or all double, or all character. But what happens if we try to use c()
to create a mixed vector?
x=c(TRUE,"Blue")
x
## [1] "TRUE" "Blue"
Wait a minute! Did the logical TRUE
just become the string "TRUE"
? Yes! When we try to create a mixed vector using c()
, it coerces all the entries to switch to the most flexible entry type in the vector. The order from least flexible to most flexible is: logical, integer, double, character. In going form logical to integer, TRUE
becomes 1
and FALSE
becomes 0
.
x=c(TRUE,5)
x
## [1] 1 5
To determine other behaviors experimentally, use the functions as.logical()
, as.integer()
, as.double()
, as.character()
.
as.double(FALSE)
## [1] 0
as.character(FALSE)
## [1] "FALSE"
Coercion also happens when we apply a mathematical function to a non-numeric type. For instance, plus:
TRUE+TRUE
## [1] 2
sum(c(T,F,F,T,T)) # Finds for us the TOTAL NUMBER of TRUEs!
## [1] 3
mean(c(T,F,F,T,T)) # Finds for us the PROPORTION of TRUEs!
## [1] 0.6
Using the foregoing, coercion can be quite useful. Let’s determine the number of cars in the mtcars
data set with mpg greater than 15, and the proportion of cars in the mtcars
data set with mpg greater than 15 .
# ?mtcars # to read about it
# View(mtcars) # to view it
mtcars$mpg > 15
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
sum(mtcars$mpg > 15) # this is number of cars with mpg > 15
## [1] 26
mean(mtcars$mpg > 15) # this is proportion of cars with mpg > 15
## [1] 0.8125
dim(mtcars) # let's find the number of rows and columns in the data set
## [1] 32 11
mean(mtcars$mpg > 15) == 26/32 # check if it is as expected
## [1] TRUE
Another application of coercion to data analysis is the computation of the error rate of a machine learning technique. Suppose the true observations of a binary classification on 5 individuals is 1,0,0,1,1
and suppose a machine learning technique predicts on the 5 individuals the classification 1,1,1,1,1
.
Use coercion to compute the error rate. The error rate is the proportion of incorrect predictions.
actual = c(1,0,0,1,1)
prediction = c(1,1,1,1,1)
mean(prediction != actual)
## [1] 0.4
mean(prediction != actual) == 2/5 # check if it is what we expect
## [1] TRUE
Indexing in can also be used to reassign an entry.
x=c(10,20)
x[2] = -55
x
## [1] 10 -55
Indexing in can also be used to append entries. We can also append using c()
in combination with reassignment.
x=c(10,20)
x[3]=30 # append a third entry 30 by indexing in to the not-yet-existing third index, it gets created in the process!
x
## [1] 10 20 30
x=c(x,40) # append a fourth entry 40 using c()
x
## [1] 10 20 30 40
The command length()
tells us how many entries a vector has, which is useful when we don’t already know the length of a vector.
length(c(1,2,3,4))
## [1] 4
To quickly create vectors of a given type with default entries, use logical()
, numeric()
, character()
.
logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE
!logical(5) # the exclamation point means "not" and its application here is vectorized
## [1] TRUE TRUE TRUE TRUE TRUE
numeric(5)
## [1] 0 0 0 0 0
character(5)
## [1] "" "" "" "" ""
We can repeat a sequence as well.
rep(c(1,2,3), times=4)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
We can also create a sequence with jumping by steps.
seq(from = 1, to = 100, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
seq(from = 1, by = 2, length.out = 10)
## [1] 1 3 5 7 9 11 13 15 17 19
By now you should be nearly finished with the first DataCamp course, and starting the second one.