Stats 535 Lecture 1: Reproducibility Appendix to Showcase Course Objectives, RStudio, R Markdown, and Vectors

Homework

Install R and RStudio following the instructions in this document.
Watch the videos about RStudio and R Markdown in the aforementioned document.
Read Sections 27.1 - 27.4.3 of Wickham’s book R for Data Science.
Download and save the R Markdown Cheatsheet for future reference. Read only the the left column of Page 2 called Pandoc’s Markdown. Don’t worry about the rest.
Do HW 1 on Canvas.
DataCamp course Introduction to R through our free class subscription.
Read lightly the Introduction, Chapter 1, and Chapter 2 of Matloff’s book. Don’t worry about understanding everything, just read it lightly. Focus on understanding Sections 1.2, 1.4, 1.7.1, 1.7.2, and all of Chapter 2 except the Extended Examples.

Motivational Case Study of Reproducible Research: A Reproducibility Appendix about Life Expectancy in Developing Countries

In this course, you will learn how to use R Markdown in RStudio in order to produce a complete record of your data analyses. Such a record would be a technical appendix to a report you are writing for your job or a scientific investigation. The audience for your technical appendix is yourself and anyone else interested in how you obtained the results!

Here is an example of a reproducibility appendix we discussed in class.

Introduction to RStudio

We will primarily interact with R through the interactive development environment called RStudio.

RStudio has multiple slidable windows that interact with each other. Here are few descriptions.

In the console we work interactively and type commands.
In the file editor window we can save our work to R Markdown files or to scripts, or other kinds of files. This is where we will do our homework. We can run a line to the console using CTRL+Enter. We can run a code block to the console with CTRL+SHIFT+Enter.
environment window
history window
help window (you can search in the help window, or you can type ?word in the console and the answer will appear in the help window)
plots window.

Try it out!

In the console type 2+2.

In the console type print("Hello world!").

In the console type mean(c(1,2,3)).

In the console type x=5.

In the console type ?mean.

In the console type plot(1,2).

Then look at the plot window, environment window, history window, and help window.

Introduction to R Markdown and Knitting to Html, Word, or PDF

This is how we will do non-DataCamp homework, and it is an important tool for reproducible research. With R Markdown we can provide a complete record of our code in html, pdf, or Word format, together with explanatory text. To do this, one writes a plain text file with .Rmd extension, and then one “knits” it with the knit button.

See Sections 27.1 - 27.4.3 of Wickham’s book R for Data Science.

Important points:

To open a new Rmd file in RStudio, use the File button.
Delete the generic statements.
Modify the YAML header to your name, etc.
To make a new section heading use
## Descriptive Heading Name.
Notice the space after ## and notice this is outside of codeblocks!
Include an Overview heading and External Requirements heading.
Code is written in codeblocks. Create a codeblock with the green insert button, or CTRL+ALT+I.
In a codeblock, # does not create a heading. Instead it creates a comment that is ignored by the computer.
To run the current line with the cursor from a codeblock to the console, use CTRL+Enter.
To run the current code block with the cursor to the console, use CTRL+SHIFT+Enter.
Outside of codeblocks, you can use LaTeX code between $ $ to make math formulas. For instance $E=mc^2$ is typeset with $E=mc^2$.
Outside of codeblocks, backticks make a verbatim enviroment. Here verbatim was typeset with `verbatim`.
Knitting to PDF won’t work for you unless you already have a LaTeX installation that properly communicates with RStudio. So instead, to create a pdf file, knit to html or knit to Word, and pdf print that file from your browser or Word. If your computer does not yet have the capability to print to pdf, then install CutePDF or some similar free product.
If your Rmd file has a single error, it won’t knit. So, I recommend writing a small piece and then knit it, in order to find your errors and debug them. Run individual codeblocks to the command line to debug them, rather than knitting the whole file.

Look again at the example of a reproducibility appendix and read it (skip the long code for the graphs and the hypothesis test).

Homework 1

Code the pdf document passed out in class in RMarkdown, knit it to html and Word, pdf print it, and upload both to Canvas HW1.

Data Types of Scalars (=One-Dimensional Vectors)

Various operations and commands we will want to do for data analysis in R only work for certain kinds of objects. A common source of bugs is to attempt to apply a command to an object of the wrong type. So let’s understand the most elementary kinds of of objects with data types: scalars.

In R, a scalar is just a 1-dimensional vector. There are 4 data types of scalars (and vectors) we will deal with. These data types are also sometimes called modes.

Logical (aka Boolean): TRUE and FALSE, or equivalently T and F
Integer: whole numbers such as -100, -1, 0, 1, 55, but with an L appended to indicate integer, so -100L, -1L, 0L, 1L, 55L
Double (aka Float): real numbers, such as -100, -1, 0, 1, 1.0, pi, .78, 34/59
Character (aka String): any sequence of symbols in single or equivalently double quotes, such as "Hello" or equivalently 'Hello', or such as '12 abC'. Notice: a space is a string character, and case matters.

To determine the type of an object, use the command typeof().

typeof(TRUE)

## [1] "logical"

typeof(-100L)

## [1] "integer"

typeof(-100)

## [1] "double"

typeof("Hello world!")

## [1] "character"

typeof("-100")

## [1] "character"

To ask if an object is a specific one of these types, use the commands: is.logical(), is.integer(), is.double(), is.character().

is.logical(-100)

## [1] FALSE

is.integer(-100)

## [1] FALSE

is.double(-100)

## [1] TRUE

is.character(-100)

## [1] FALSE

Download the In-Class Exercise File

Lecture 1 In-Class Exercises

Comparing Scalars

==, <=, >=, and !=

These also work for comparing vectors, as we shall soon see.

Assignment and Reassignment

Variables do not need to be declared in advance in R. x=10 or x<-10 both save 10 to x. The symbol = is faster to type, so I prefer that.

Notice that when we save 10 to x in the console, nothing is returned.

Variables can be reassigned at will, even with different types. For instance we could subsequently type x="Cool!" and now x has this value.

R uses “pass by value”, which makes it easier to reason about code. For instance, in the following, changing x later does not change y.

x=10
y=x
x=20
y

## [1] 10

Creating Vectors

All elements of a vector must have the same data type (mode). Here are a few ways to create vectors.

c() applied to elements “concatenates” them

1:5 creates a vector of integers

5:1 creates a vector of integers in the opposite order

c() applied to vectors concatenates them in a flat way, it does not create a 2-level structure.

We can store a vector in a variable x using x =. We can give the entries names with

names(x)= c(a vector of names of same length)

Indexing into Vectors

x=5:9
names(x)=c("p","q","r","s")
x[3]

## r 
## 7

x["r"]

## r 
## 7

x[2:4]

## q r s 
## 6 7 8

x[c(F,T,T,T)]

## q r s 
## 6 7 8

Vectorization: R Does Operations and Most Math Functions Entrywise Automatically

c(1,2,3) + c(10,11,12) # add corresponding entries

## [1] 11 13 15

c(1,2,3) * c(10,11,12) # multiply corresponding entries

## [1] 10 22 36

c(1,2,3) / c(10,20,30) # divide corresponding entries

## [1] 0.1 0.1 0.1

5*c(1,2,3)  # scalar multiplication, in other words multiply each entry by scalar

## [1]  5 10 15

c(1,2,3)*5 # also scalar multiplication

## [1]  5 10 15

c(1,2,3) +5 # add scalar to each entry

## [1] 6 7 8

c(1,2,3,4) >2 # do the logical comparison for each entry

## [1] FALSE FALSE  TRUE  TRUE

c(1,2,1,1,5) == 1 # do the logical comparison for each entry

## [1]  TRUE FALSE  TRUE  TRUE FALSE

# to find all locations of the repeated 1's
c(1,2,3,4,5) == c(8,2,3,4,8) # do the logical comparison of corresponding entries

## [1] FALSE  TRUE  TRUE  TRUE FALSE

Another aspect of vectorization is that many functions apply elementwise to vectors: squaring, square root, trig functions, exponentials, logarithms, etc.

c(1,2,3)^2

## [1] 1 4 9

sqrt(c(1,4,9))

## [1] 1 2 3

sin(c(0,pi/2,pi))

## [1] 0.000000e+00 1.000000e+00 1.224606e-16

exp(c(1,2,3))

## [1]  2.718282  7.389056 20.085537

log(c(exp(1),exp(2),exp(3))) # Aha! log means natural log, not base 10!

## [1] 1 2 3

log10(c(10^1,10^2,10^3))

## [1] 1 2 3

Download the Second Part of Lecture 1 In-Class Exercise File

Lecture 1 In-Class Exercises Second Part

Filtering a Vector

Recall that we have seen how to obtain a subvector from a vector using a logical vector: only the elements indexed by TRUEs are included the subvector.

c('abc','mno','xyz')[c(TRUE,FALSE,TRUE)]

## [1] "abc" "xyz"

Recall also that we saw how to obtain a logical vector from a condition on a vector.

c(-10,10,0,40,-5) >0

## [1] FALSE  TRUE FALSE  TRUE FALSE

The combination of these two allow us to elegantly filter a vector according to a condition! Let’s find the subvector of the following vector consisting only of positive entries, and the subvector consisting only of negative entries, both using filtering.

x=c(-10,10,0,40,-5)
x[x>0]

## [1] 10 40

x[x<0]

## [1] -10  -5

Coercion

Recall that we said a vector in R has all of its entries of the same data type (mode). They must all be logical, or all integer, or all double, or all character. But what happens if we try to use c() to create a mixed vector?

x=c(TRUE,"Blue")
x

## [1] "TRUE" "Blue"

Wait a minute! Did the logical TRUE just become the string "TRUE"? Yes! When we try to create a mixed vector using c(), it coerces all the entries to switch to the most flexible entry type in the vector. The order from least flexible to most flexible is: logical, integer, double, character. In going form logical to integer, TRUE becomes 1 and FALSE becomes 0.

x=c(TRUE,5)
x

## [1] 1 5

To determine other behaviors experimentally, use the functions as.logical(), as.integer(), as.double(), as.character().

as.double(FALSE)

## [1] 0

as.character(FALSE)

## [1] "FALSE"

Coercion also happens when we apply a mathematical function to a non-numeric type. For instance, plus:

TRUE+TRUE

## [1] 2

sum(c(T,F,F,T,T)) # Finds for us the TOTAL NUMBER of TRUEs!

## [1] 3

mean(c(T,F,F,T,T)) # Finds for us the PROPORTION of TRUEs!

## [1] 0.6

Using the foregoing, coercion can be quite useful. Let’s determine the number of cars in the mtcars data set with mpg greater than 15, and the proportion of cars in the mtcars data set with mpg greater than 15 .

# ?mtcars # to read about it 
# View(mtcars) # to view it
mtcars$mpg > 15

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

sum(mtcars$mpg > 15) # this is number of cars with mpg > 15

## [1] 26

mean(mtcars$mpg > 15) # this is proportion of cars with mpg > 15

## [1] 0.8125

dim(mtcars) # let's find the number of rows and columns in the data set

## [1] 32 11

mean(mtcars$mpg > 15) == 26/32 # check if it is as expected

## [1] TRUE

Another application of coercion to data analysis is the computation of the error rate of a machine learning technique. Suppose the true observations of a binary classification on 5 individuals is 1,0,0,1,1 and suppose a machine learning technique predicts on the 5 individuals the classification 1,1,1,1,1.

Use coercion to compute the error rate. The error rate is the proportion of incorrect predictions.

actual = c(1,0,0,1,1)
prediction = c(1,1,1,1,1)
mean(prediction != actual)

## [1] 0.4

mean(prediction != actual) == 2/5 # check if it is what we expect

## [1] TRUE

Other Things to Know about Vectors: Reassigning an Entry, Appending Elements, Length, More Creation Methods

Indexing in can also be used to reassign an entry.

x=c(10,20)
x[2] = -55
x

## [1]  10 -55

Indexing in can also be used to append entries. We can also append using c() in combination with reassignment.

x=c(10,20)
x[3]=30  # append a third entry 30 by indexing in to the not-yet-existing third index, it gets created in the process!
x

## [1] 10 20 30

x=c(x,40) # append a fourth entry 40 using c()
x

## [1] 10 20 30 40

The command length() tells us how many entries a vector has, which is useful when we don’t already know the length of a vector.

length(c(1,2,3,4))

## [1] 4

To quickly create vectors of a given type with default entries, use logical(), numeric(), character().

logical(5)

## [1] FALSE FALSE FALSE FALSE FALSE

!logical(5) # the exclamation point means "not" and its application here is vectorized

## [1] TRUE TRUE TRUE TRUE TRUE

numeric(5)

## [1] 0 0 0 0 0

character(5)

## [1] "" "" "" "" ""

We can repeat a sequence as well.

rep(c(1,2,3), times=4)

##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

We can also create a sequence with jumping by steps.

seq(from = 1, to = 100, by = 2)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99

seq(from = 1, by = 2, length.out = 10)

##  [1]  1  3  5  7  9 11 13 15 17 19

Debugging Tips

Try briefly to understand the error message.
Check your spelling.
Check punctuation.
Check what kind of object you have, and if the function you’re using can apply to it.
Keep in mind that the error can be earlier in the code!
Code and run in small pieces to discover errors as you type.
Insert print statements to confirm each step does what you want.
Consult help documentation.
Google the problem.
Ask a friend or our discussion board, but give enough information for others to recreate the problem and diagnose it.
Regularly sleep….
Regularly do cardio exercise…
Take a break.

Stats 535 Lecture 1: Reproducibility Appendix to Showcase Course Objectives, RStudio, R Markdown, and Vectors

Thomas Fiore

May 6 and 8, 2019

Homework

Motivational Case Study of Reproducible Research: A Reproducibility Appendix about Life Expectancy in Developing Countries

Introduction to RStudio

Introduction to R Markdown and Knitting to Html, Word, or PDF

Homework 1

Data Types of Scalars (=One-Dimensional Vectors)

Download the In-Class Exercise File

Comparing Scalars

Assignment and Reassignment

Creating Vectors

Indexing into Vectors

Vectorization: R Does Operations and Most Math Functions Entrywise Automatically

Download the Second Part of Lecture 1 In-Class Exercise File

Filtering a Vector

Coercion

Other Things to Know about Vectors: Reassigning an Entry, Appending Elements, Length, More Creation Methods

Debugging Tips

Stats 535 Lecture 1: Reproducibility Appendix to Showcase Course Objectives, RStudio, R Markdown, and Vectors

Thomas Fiore

May 6 and 8, 2019

Homework

Motivational Case Study of Reproducible Research: A Reproducibility Appendix about Life Expectancy in Developing Countries

Introduction to RStudio

Introduction to R Markdown and Knitting to Html, Word, or PDF

Homework 1

Data Types of Scalars (=One-Dimensional Vectors)

Download the In-Class Exercise File

Comparing Scalars

Assignment and Reassignment

Creating Vectors

Indexing into Vectors

Vectorization: R Does Operations and Most Math Functions Entrywise Automatically

Download the Second Part of Lecture 1 In-Class Exercise File

Filtering a Vector

Coercion

Other Things to Know about Vectors: Reassigning an Entry, Appending Elements, Length, More Creation Methods

Debugging Tips

Login to DataCamp