Chapter 1 R Basics

1.1 Vectors

1.1.1 Creating Vectors

# basic vector
v <- c(2,3,8,5)

#repeated values
rep(c(1,2),5) 

#sequence
seq(4,7) #inclusive, and by default counts by 1
#shorthand
4:7

#custom length or increments
seq(0, 10, by = 2)
seq(0, 3, length.out = 7)

1.1.2 Accessing Vectors

myData <- c(2,3,8,5)

#element wise
myData[2] #selects second entry

myData[-3] #excludes third element

#can also pass in vectors to index it
myData[c(1,4)]
myData[2:4]

1.1.3 Vector Functions

Essential Functions

  • mean()
  • sd()
  • var()
  • max()
  • min()
  • median()
  • range()
  • quantile()
  • cumsum()
  • sum()

Other Functions

  • sort() - sorts a vector (alphabetically or by increasing size when numerical)
  • rank() - provides the rank of each element
  • order() - gives the indices of the elements in order
  • unique() - returns just the unique values in the vector
  • length() - total number of elements in the vector
  • paste() - makes each element in the vector a string

Particularly Interesting Functions

  • sample() - randomly sample from the elements of a vector
sample(c(3,7,9,23,45), 3, replace = FALSE)
#sample from vector, choosing n=3 without replacement
  • table() - provide counts of the occurrence of each element
table(sample(1:6, 200, replace = TRUE))
  • is.na() - gives a TRUE/FALSE vector as the output checking is an entry is NA

1.2 Data Frames

Structure and viewing data frames

To show a structure of a variable, dataframe, list etc, run str(x).

View(mydata) to see in spreadsheet view

1.2.1 Creating Data Frames

Manually

hw <- data.frame(Height = c(147, 150, 152),
                Weight = c(52.2, 53.1, 54.4))

Load from *.csv file

If in working directory,

hw <- read.csv("hw.csv")

Can also use RStudio “Import Dataset” button in Environment tab (top left)

Built in Data Sets Can use data frames built into R, for example:

data("cars")

Run data() to see all data sets

1.2.2 Accessing Data Frames

  • Single column (alternative ways):
    • will collapse Data Frame column into vector
hw$Height
hw[,1]
  • keeps data frame structure only keeping 1 column
hw[,1, drop = FALSE]
  • Single element: hw$Height[3] or hw[3,1]
  • Single row: hw[3,] (similar to single column)

Full access

df[ r , c ]

  • r can be empty, integer between 1 and number of row, vector of integers or a vector of TRUE/FALSE of length num rows (must be exact length).
  • c can be empty, integer between 1 and number of row, vector of integers, a vector of TRUE/FALSE of length num rows (must be exact length) or vector of strings with column/variable names.

Examples (first 3 equivalent)

wq.red$ph
wq.red[, 9]
wq.red[, "pH"]

wq.red[, c(9,11)]
wq.red[, c("pH","alcohol")]

1.2.2.1 Advanced Query - Accessing a Subset of a Data Frame

Examples

  • Entire column “Weight” with values >50
hw[hw$Weight>50,]
  • Both columns, 4 random rows (using sample())
hw[sample(1:nrow(hw), 4),]

-Removing NA values

movies[!is.na(movies$budget),] #doesn't have any NA value

Note: ! is the logical NOT operator and is.na() is a function that acts on a vector giving a TRUE/FALSE vector as the output

  • extra example
wq.red[wq.red$pH>3 & wq.red$density<1, "alcohol"]
#wq.red$pH>3 is a vector of TRUE/FALSE
#wq.red$density<3 is a vector of TRUE/FALSE
#& takes logical between two vectors

1.2.3 Interrogating Data Frames

Data Frame Functions - information

  • names() - column names
  • dim() - number of rows and columns
  • nrow() - number of rows
  • ncol() - number of columns
  • head() - useful for large data, just shows top rows (can add extra parameter to specify how many rows show up)
  • str() - shows details about type of data

Data Frame Functions - interesting data

  • colMeans()
  • rowMeans()
  • colSums()
  • rowSums()
  • cov() - covariance matrix
  • cor() - correlation matrix
  • scale() - scales data to be centered at 0 and scaled (both have optional arguments available)
  • summary() - gives all major statistics for each variable (column)

Data Frame (column/row) Functions - sorting and ordering

(optional decreasing = TRUE argument)

  • `sort(my.data$var) - sorts a variable but only outputs that column vector sorted
  • `order(my.data$var) - outputs a list of indices sorted

Sorting a Data Frame by a Column

  • my.data[order(my.data$var),] - sorts whole data frame according to var column

1.2.4 Manipulating Data Frames

Creating/Adding Variables to a Data Frame

Reference a variable that doesn’t exist and just assign it to something. For example,

hw$BMI <- hw$Weight/(hw$Height/100)^2

1.2.4.1 Merging Data Frames

rbind(,) and cbind(,)

Note: can be very error prone and often better to use tidyverse (unless with rbind variables are identical and in same order or with cbind, observations are in same order)

  • rbind() pastes rows together (above/below)
  • cbind() pastes columns together (left/right)

For an example with rbind(,), if two data frames have same column names, rbind(,) will stack the rows to make one data frame. As an example,

test1 <- data.frame(Col1 = c(1,5,9,2),
                    Col2 = c(6,9,8,3))
test2 <- data.frame(Col2 = c(5,9,0,1),
                    Col1 = c(4,9,3,0))

rbind(test1, test2)
##   Col1 Col2
## 1    1    6
## 2    5    9
## 3    9    8
## 4    2    3
## 5    4    5
## 6    9    9
## 7    3    0
## 8    0    1

(notice how the columns were matched by name not order)

1.2.5 Missing Data in Data Frames

1.2.5.1 Importing Data with Missing Values

When using read.csv("mydata.csv"), can add the additional argument na.strings=c(...) to set any strings in the vector to be replaced by <NA>.

As an example,

read.csv("carsdata.csv", na.strings = c("", "na"))

would import the cardata.csv file as a data frame with all strings that are empty "" or "na" with the approptiate <NA> tag.

Ignoring NA values - na.rm = TRUE

na.rm = TRUE argument ignores all <NA> values when performing the function. As an example,

mean(mydata$var, na.rm = TRUE)

which is equivalent to

mean(na.omit(mydata$var))

1.2.5.2 Remove Ovservations with <NA> Entry (na.omit())

Example,

na.omit(chickwts$weight)

1.3 Lists

x <- list(1, "a", c(1,2,3), data.frame(a = 1:3, b = 4:6))
  • can have a list of any type of variable
  • can be good for hierarchical and tree structures, Nesting is permitted (i.e. lists can contain lists)
  • variables in the list can also be named
x <- list(bob = 1, jill = "a", jack = c(1,2,3),
          eve = data.frame(a = 1:3, b = 4:6))
x$eve$a
## [1] 1 2 3

1.3.1 Accessing lists

x[] #accesses an element of a list (returns a list of one element)
x[[]] #strips away one level of hierarchy, list structure is gone
x$bob #equivalent to above when items in list are named

1.4 Data Types

  • Numeric
  • Logical (TRUE/FALSE)
  • Categorical (called factors in R), could be ordered (e.g. credit rating) or could not be (e.g. eye colour)
  • Date/Time
  • Text or String
  • Others (e.g. image, spatial, audio, video)

1.4.1 Categorical Data (Factors)

1.4.1.1 Making Data Frames Categorical

R cannot tell the difference between factors and strings when importing data frames, so we must “tidy” them up after importing

Cleaning up a data frame (all non-numeric by default get imported as strings) by making all strings in one variable a factor

eyesDF <- data.frame(name=c("anne","john","charlie","sarah","max","ellie","eve"),
                     eyeColour=c("blue","green","brown","brown","blue","blue","brown"))

#changes a variable to be categorical (a factor)
eyesDF$eyeColour <- as.factor(eyesDF$eyeColour)

summary(eyesDF$eyeColour)
##  blue brown green 
##     3     3     1

Treat data as a factor or a text?

  • Treat as text not factor when every observation is unique (e.g. surname)
  • When some text is coming up very often it may be more appropriate as a factor

1.4.1.2 Making Vectors Categorical

We can create a vector with factors in by creating the entries as strings and then applying the factor() function

For example,

eye.colour <- factor(c("blue","brown","green","blue","blue","brown","green","blue","blue","green","blue","green","blue","blue","brown","green","brown","brown","green"))
summary(eye.colour)
##  blue brown green 
##     8     5     6

Example Using Factors The following data frame has a factor variable (feed)

data("chickwts")
head(chickwts)
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean
summary(chickwts)
##      weight             feed   
##  Min.   :108.0   casein   :12  
##  1st Qu.:204.5   horsebean:10  
##  Median :258.0   linseed  :12  
##  Mean   :261.3   meatmeal :11  
##  3rd Qu.:323.5   soybean  :14  
##  Max.   :423.0   sunflower:12

We can filter out specific factors,

chickwts[chickwts$feed %in% c("sunflower","linseed"),]

Number of levels (different factors) and names of levels

nlevels(chickwts$feed)
## [1] 6
levels(chickwts$feed)
## [1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"

unclass() gives each entry a number corresponding to a factor

unclass(chickwts$feed)
##  [1] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6
## [39] 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1
## attr(,"levels")
## [1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"

1.5 Other

Packages

  • Installing a package: install.packages("dplyr")
  • Loading a package: library("dplyr")

Loops

for(i in x) {
  #i iterates over the values in vector x
}

Functions

myFunction <- function(arg1, arg2 = 1) { 
  ...
  return(...)
}

Functions acting on vectors - sapply()

sapply(X, Fun)

Applies function Fun to vector X element-wise

Logical Operators

Logical operators act element-wise on pairs of vectors (of same size) of TRUE and FALSE values. AND is &, OR is |.