R Tutorial

Welcome to the R tutorial version of Data Science Rosetta Stone. Before beginning this tutorial, please check to make sure you have R 3.3.1 installed (this is not required, but this was the release used to generate the following examples). Also, the following R packages are used throughout this tutorial. You may not need all of the following packages to fit your specific needs, but they are listed below, and also in Appendix Section 2 with more detail, for your information:

To install R packages you need to run the following in the R console:

install.packages("name_of_package")

Note: In R, comments are indicated in code with a “#” character, and arrays and matrices begin with index 1. Also, “<-” and “=” can be used interchangeably.

Now let’s get started!


1 Reading in Data and Basic Statistical Functions

1.1 Read in the data.

a) Read the data in as a .csv file.

student <- read.csv('/Users/class.csv')

read.csv()

b) Read the data in as a .xls file.

# call the gdata package
library(gdata)

student_xls <- read.xls('/Users/class.xls', 1)

gdata | read.xls()

c) Read the data in as a .json file.

There is more code involved in reading a .json file into R so it becomes a proper data frame. Also, this code is specific for a certain .json format, so you may have to change it to fix your needs.

# call the rjson package
library(rjson)

temp <- fromJSON(file = '/Users/class.json')
temp <- do.call('rbind', temp)
temp <- data.frame(temp, stringsAsFactors = TRUE)
temp <- transform(temp, Name=unlist(Name), Sex=unlist(Sex), Age=unlist(Age),
                  Height=unlist(Height), Weight=unlist(Weight))
temp$Name <- as.factor(temp$Name)
temp$Sex <- as.factor(temp$Sex)
temp$Age <- as.integer(temp$Age)

student_json <- temp

rjson | fromJSON()

1.2 Find the dimensions of the data set.

The shape of an R data frame is available by calling the dim() function, with the data name as an argument.

dim(student)
## [1] 19  5

1.3 Find basic information about the data set.

Information about an R data frame is available by calling the str() function, with the data name as an argument.

str(student)
## 'data.frame':    19 obs. of  5 variables:
##  $ Name  : Factor w/ 19 levels "Alfred","Alice",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sex   : Factor w/ 2 levels "F","M": 2 1 1 1 2 2 1 1 2 2 ...
##  $ Age   : int  14 13 13 14 14 12 12 15 13 12 ...
##  $ Height: num  69 56.5 65.3 62.8 63.5 57.3 59.8 62.5 62.5 59 ...
##  $ Weight: num  112 84 98 102 102 ...

1.4 Look at the first 5 (last 5) observations.

The first 5 observations of a data frame are available by calling the head() function, with the data name as an argument. By default, head() returns 4 observations, but we can alter the function to return 5 observations in the way shown below (n= ). The tail() function is analogous and returns the last observations.

head(student, n=5)

1.5 Calculate means of numeric variables.

# We must apply the is.numeric() function to the data set which returns a 
# matrix of booleans that we then use to subset the data set to return 
# only numeric variables  

# Then we can use the colMeans() function to return the means of 
# column variables
colMeans(student[sapply(student, is.numeric)])
##       Age    Height    Weight
##  13.31579  62.33684 100.02632

colMeans() | sapply() | is.numeric

1.6 Compute summary statistics of the data set.

Summary statistics of a data frame are available by calling the summary() function, with the data name as an argument.

summary(student)
##       Name    Sex         Age            Height          Weight
##  Alfred : 1   F: 9   Min.   :11.00   Min.   :51.30   Min.   : 50.50
##  Alice  : 1   M:10   1st Qu.:12.00   1st Qu.:58.25   1st Qu.: 84.25
##  Barbara: 1          Median :13.00   Median :62.80   Median : 99.50
##  Carol  : 1          Mean   :13.32   Mean   :62.34   Mean   :100.03
##  Henry  : 1          3rd Qu.:14.50   3rd Qu.:65.90   3rd Qu.:112.25
##  James  : 1          Max.   :16.00   Max.   :72.00   Max.   :150.00
##  (Other):13

1.7 Descriptive statistics functions applied to columns of the data set.

# Notice the subsetting of student with the "$" character 
sd(student$Weight)
## [1] 22.77393
sum(student$Weight)
## [1] 1900.5
length(student$Weight)
## [1] 19
max(student$Weight)
## [1] 150
min(student$Weight)
## [1] 50.5
median(student$Weight)
## [1] 99.5

1.8 Produce a one-way table to describe the frequency of a variable.

a) Produce a one-way table of a discrete variable.