Homework

Motivational Case Study of Reproducible Research: A Reproducibility Appendix about Life Expectancy in Developing Countries

In this course, you will learn how to use R Markdown in RStudio in order to produce a complete record of your data analyses. Such a record would be a technical appendix to a report you are writing for your job or a scientific investigation. The audience for your technical appendix is yourself and anyone else interested in how you obtained the results!

Here is an example of a reproducibility appendix we discussed in class.

Introduction to RStudio

We will primarily interact with R through the interactive development environment called RStudio.

RStudio has multiple slidable windows that interact with each other. Here are few descriptions.

Try it out!

In the console type 2+2.

In the console type print("Hello world!").

In the console type mean(c(1,2,3)).

In the console type x=5.

In the console type ?mean.

In the console type plot(1,2).

Then look at the plot window, environment window, history window, and help window.

Introduction to R Markdown and Knitting to Html, Word, or PDF

This is how we will do non-DataCamp homework, and it is an important tool for reproducible research. With R Markdown we can provide a complete record of our code in html, pdf, or Word format, together with explanatory text. To do this, one writes a plain text file with .Rmd extension, and then one “knits” it with the knit button.

See Sections 27.1 - 27.4.3 of Wickham’s book R for Data Science.

Important points:

Look again at the example of a reproducibility appendix and read it (skip the long code for the graphs and the hypothesis test).

Homework 1

Code the pdf document passed out in class in RMarkdown, knit it to html and Word, pdf print it, and upload both to Canvas HW1.

Data Types of Scalars (=One-Dimensional Vectors)

Various operations and commands we will want to do for data analysis in R only work for certain kinds of objects. A common source of bugs is to attempt to apply a command to an object of the wrong type. So let’s understand the most elementary kinds of of objects with data types: scalars.

In R, a scalar is just a 1-dimensional vector. There are 4 data types of scalars (and vectors) we will deal with. These data types are also sometimes called modes.

To determine the type of an object, use the command typeof().

typeof(TRUE)
## [1] "logical"
typeof(-100L)
## [1] "integer"
typeof(-100)
## [1] "double"
typeof("Hello world!")
## [1] "character"
typeof("-100")
## [1] "character"

To ask if an object is a specific one of these types, use the commands: is.logical(), is.integer(), is.double(), is.character().

is.logical(-100)
## [1] FALSE
is.integer(-100) 
## [1] FALSE
is.double(-100) 
## [1] TRUE
is.character(-100)
## [1] FALSE

Download the In-Class Exercise File

Lecture 1 In-Class Exercises

Comparing Scalars

==, <=, >=, and !=

These also work for comparing vectors, as we shall soon see.

Assignment and Reassignment

Variables do not need to be declared in advance in R. x=10 or x<-10 both save 10 to x. The symbol = is faster to type, so I prefer that.

Notice that when we save 10 to x in the console, nothing is returned.

Variables can be reassigned at will, even with different types. For instance we could subsequently type x="Cool!" and now x has this value.

R uses “pass by value”, which makes it easier to reason about code. For instance, in the following, changing x later does not change y.

x=10
y=x
x=20
y
## [1] 10

Creating Vectors

All elements of a vector must have the same data type (mode). Here are a few ways to create vectors.

c() applied to elements “concatenates” them

1:5 creates a vector of integers

5:1 creates a vector of integers in the opposite order

c() applied to vectors concatenates them in a flat way, it does not create a 2-level structure.

We can store a vector in a variable x using x =. We can give the entries names with

names(x)= c(a vector of names of same length)

Indexing into Vectors

x=5:9
names(x)=c("p","q","r","s")
x[3]
## r 
## 7
x["r"]
## r 
## 7
x[2:4]
## q r s 
## 6 7 8
x[c(F,T,T,T)]
## q r s 
## 6 7 8

Vectorization: R Does Operations and Most Math Functions Entrywise Automatically

c(1,2,3) + c(10,11,12) # add corresponding entries
## [1] 11 13 15
c(1,2,3) * c(10,11,12) # multiply corresponding entries
## [1] 10 22 36
c(1,2,3) / c(10,20,30) # divide corresponding entries
## [1] 0.1 0.1 0.1
5*c(1,2,3)  # scalar multiplication, in other words multiply each entry by scalar
## [1]  5 10 15
c(1,2,3)*5 # also scalar multiplication 
## [1]  5 10 15
c(1,2,3) +5 # add scalar to each entry
## [1] 6 7 8
c(1,2,3,4) >2 # do the logical comparison for each entry
## [1] FALSE FALSE  TRUE  TRUE
c(1,2,1,1,5) == 1 # do the logical comparison for each entry 
## [1]  TRUE FALSE  TRUE  TRUE FALSE
# to find all locations of the repeated 1's
c(1,2,3,4,5) == c(8,2,3,4,8) # do the logical comparison of corresponding entries
## [1] FALSE  TRUE  TRUE  TRUE FALSE

Another aspect of vectorization is that many functions apply elementwise to vectors: squaring, square root, trig functions, exponentials, logarithms, etc.

c(1,2,3)^2
## [1] 1 4 9
sqrt(c(1,4,9))
## [1] 1 2 3
sin(c(0,pi/2,pi))
## [1] 0.000000e+00 1.000000e+00 1.224606e-16
exp(c(1,2,3))
## [1]  2.718282  7.389056 20.085537
log(c(exp(1),exp(2),exp(3))) # Aha! log means natural log, not base 10!
## [1] 1 2 3
log10(c(10^1,10^2,10^3))
## [1] 1 2 3

Download the Second Part of Lecture 1 In-Class Exercise File

Lecture 1 In-Class Exercises Second Part

Filtering a Vector

Recall that we have seen how to obtain a subvector from a vector using a logical vector: only the elements indexed by TRUEs are included the subvector.

c('abc','mno','xyz')[c(TRUE,FALSE,TRUE)]
## [1] "abc" "xyz"

Recall also that we saw how to obtain a logical vector from a condition on a vector.

c(-10,10,0,40,-5) >0
## [1] FALSE  TRUE FALSE  TRUE FALSE

The combination of these two allow us to elegantly filter a vector according to a condition! Let’s find the subvector of the following vector consisting only of positive entries, and the subvector consisting only of negative entries, both using filtering.

x=c(-10,10,0,40,-5)
x[x>0]
## [1] 10 40
x[x<0]
## [1] -10  -5

Coercion

Recall that we said a vector in R has all of its entries of the same data type (mode). They must all be logical, or all integer, or all double, or all character. But what happens if we try to use c() to create a mixed vector?

x=c(TRUE,"Blue")
x
## [1] "TRUE" "Blue"

Wait a minute! Did the logical TRUE just become the string "TRUE"? Yes! When we try to create a mixed vector using c(), it coerces all the entries to switch to the most flexible entry type in the vector. The order from least flexible to most flexible is: logical, integer, double, character. In going form logical to integer, TRUE becomes 1 and FALSE becomes 0.

x=c(TRUE,5)
x
## [1] 1 5

To determine other behaviors experimentally, use the functions as.logical(), as.integer(), as.double(), as.character().

as.double(FALSE)
## [1] 0
as.character(FALSE)
## [1] "FALSE"

Coercion also happens when we apply a mathematical function to a non-numeric type. For instance, plus:

TRUE+TRUE
## [1] 2
sum(c(T,F,F,T,T)) # Finds for us the TOTAL NUMBER of TRUEs!
## [1] 3
mean(c(T,F,F,T,T)) # Finds for us the PROPORTION of TRUEs!
## [1] 0.6

Using the foregoing, coercion can be quite useful. Let’s determine the number of cars in the mtcars data set with mpg greater than 15, and the proportion of cars in the mtcars data set with mpg greater than 15 .

# ?mtcars # to read about it 
# View(mtcars) # to view it
mtcars$mpg > 15
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
sum(mtcars$mpg > 15) # this is number of cars with mpg > 15
## [1] 26
mean(mtcars$mpg > 15) # this is proportion of cars with mpg > 15
## [1] 0.8125
dim(mtcars) # let's find the number of rows and columns in the data set
## [1] 32 11
mean(mtcars$mpg > 15) == 26/32 # check if it is as expected
## [1] TRUE

Another application of coercion to data analysis is the computation of the error rate of a machine learning technique. Suppose the true observations of a binary classification on 5 individuals is 1,0,0,1,1 and suppose a machine learning technique predicts on the 5 individuals the classification 1,1,1,1,1.

Use coercion to compute the error rate. The error rate is the proportion of incorrect predictions.

actual = c(1,0,0,1,1)
prediction = c(1,1,1,1,1)
mean(prediction != actual)
## [1] 0.4
mean(prediction != actual) == 2/5 # check if it is what we expect 
## [1] TRUE

Other Things to Know about Vectors: Reassigning an Entry, Appending Elements, Length, More Creation Methods

Indexing in can also be used to reassign an entry.

x=c(10,20)
x[2] = -55
x
## [1]  10 -55

Indexing in can also be used to append entries. We can also append using c() in combination with reassignment.

x=c(10,20)
x[3]=30  # append a third entry 30 by indexing in to the not-yet-existing third index, it gets created in the process!
x
## [1] 10 20 30
x=c(x,40) # append a fourth entry 40 using c()
x
## [1] 10 20 30 40

The command length() tells us how many entries a vector has, which is useful when we don’t already know the length of a vector.

length(c(1,2,3,4))
## [1] 4

To quickly create vectors of a given type with default entries, use logical(), numeric(), character().

logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE
!logical(5) # the exclamation point means "not" and its application here is vectorized
## [1] TRUE TRUE TRUE TRUE TRUE
numeric(5)
## [1] 0 0 0 0 0
character(5)
## [1] "" "" "" "" ""

We can repeat a sequence as well.

rep(c(1,2,3), times=4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

We can also create a sequence with jumping by steps.

seq(from = 1, to = 100, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
seq(from = 1, by = 2, length.out = 10)
##  [1]  1  3  5  7  9 11 13 15 17 19

Debugging Tips

Login to DataCamp

By now you should be nearly finished with the first DataCamp course, and starting the second one.