Welcome to the R tutorial version of Data Science Rosetta Stone. Before beginning this tutorial, please check to make sure you have R 3.3.1 installed (this is not required, but this was the release used to generate the following examples). Also, the following R packages are used throughout this tutorial. You may not need all of the following packages to fit your specific needs, but they are listed below, and also in Appendix Section 2 with more detail, for your information:
To install R packages you need to run the following in the R console:
install.packages("name_of_package")
Note: In R, comments are indicated in code with a “#” character, and arrays and matrices begin with index 1. Also, “<-” and “=” can be used interchangeably.
Now let’s get started!
# call the gdata package
library(gdata)
student_xls <- read.xls('/Users/class.xls', 1)
There is more code involved in reading a .json file into R so it becomes a proper data frame. Also, this code is specific for a certain .json format, so you may have to change it to fix your needs.
# call the rjson package
library(rjson)
temp <- fromJSON(file = '/Users/class.json')
temp <- do.call('rbind', temp)
temp <- data.frame(temp, stringsAsFactors = TRUE)
temp <- transform(temp, Name=unlist(Name), Sex=unlist(Sex), Age=unlist(Age),
Height=unlist(Height), Weight=unlist(Weight))
temp$Name <- as.factor(temp$Name)
temp$Sex <- as.factor(temp$Sex)
temp$Age <- as.integer(temp$Age)
student_json <- temp
The shape of an R data frame is available by calling the dim() function, with the data name as an argument.
dim(student)
## [1] 19 5
Information about an R data frame is available by calling the str() function, with the data name as an argument.
str(student)
## 'data.frame': 19 obs. of 5 variables:
## $ Name : Factor w/ 19 levels "Alfred","Alice",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 1 1 2 2 1 1 2 2 ...
## $ Age : int 14 13 13 14 14 12 12 15 13 12 ...
## $ Height: num 69 56.5 65.3 62.8 63.5 57.3 59.8 62.5 62.5 59 ...
## $ Weight: num 112 84 98 102 102 ...
The first 5 observations of a data frame are available by calling the head() function, with the data name as an argument. By default, head() returns 4 observations, but we can alter the function to return 5 observations in the way shown below (n= ). The tail() function is analogous and returns the last observations.
head(student, n=5)
# We must apply the is.numeric() function to the data set which returns a
# matrix of booleans that we then use to subset the data set to return
# only numeric variables
# Then we can use the colMeans() function to return the means of
# column variables
colMeans(student[sapply(student, is.numeric)])
## Age Height Weight
## 13.31579 62.33684 100.02632
Summary statistics of a data frame are available by calling the summary() function, with the data name as an argument.
summary(student)
## Name Sex Age Height Weight
## Alfred : 1 F: 9 Min. :11.00 Min. :51.30 Min. : 50.50
## Alice : 1 M:10 1st Qu.:12.00 1st Qu.:58.25 1st Qu.: 84.25
## Barbara: 1 Median :13.00 Median :62.80 Median : 99.50
## Carol : 1 Mean :13.32 Mean :62.34 Mean :100.03
## Henry : 1 3rd Qu.:14.50 3rd Qu.:65.90 3rd Qu.:112.25
## James : 1 Max. :16.00 Max. :72.00 Max. :150.00
## (Other):13
# Notice the subsetting of student with the "$" character
sd(student$Weight)
## [1] 22.77393
sum(student$Weight)
## [1] 1900.5
length(student$Weight)
## [1] 19
max(student$Weight)
## [1] 150
min(student$Weight)
## [1] 50.5
median(student$Weight)
## [1] 99.5