Chapter 1 R Basics
1.1 Vectors
1.1.1 Creating Vectors
# basic vector
<- c(2,3,8,5)
v
#repeated values
rep(c(1,2),5)
#sequence
seq(4,7) #inclusive, and by default counts by 1
#shorthand
4:7
#custom length or increments
seq(0, 10, by = 2)
seq(0, 3, length.out = 7)
1.1.2 Accessing Vectors
<- c(2,3,8,5)
myData
#element wise
2] #selects second entry
myData[
-3] #excludes third element
myData[
#can also pass in vectors to index it
c(1,4)]
myData[2:4] myData[
1.1.3 Vector Functions
Essential Functions
mean()
sd()
var()
max()
min()
median()
range()
quantile()
cumsum()
sum()
Other Functions
sort()
- sorts a vector (alphabetically or by increasing size when numerical)rank()
- provides the rank of each elementorder()
- gives the indices of the elements in orderunique()
- returns just the unique values in the vectorlength()
- total number of elements in the vectorpaste()
- makes each element in the vector a string
Particularly Interesting Functions
sample()
- randomly sample from the elements of a vector
sample(c(3,7,9,23,45), 3, replace = FALSE)
#sample from vector, choosing n=3 without replacement
table()
- provide counts of the occurrence of each element
table(sample(1:6, 200, replace = TRUE))
is.na()
- gives aTRUE
/FALSE
vector as the output checking is an entry isNA
1.2 Data Frames
Structure and viewing data frames
To show a structure of a variable, dataframe, list etc, run str(x)
.
View(mydata)
to see in spreadsheet view
1.2.1 Creating Data Frames
Manually
<- data.frame(Height = c(147, 150, 152),
hw Weight = c(52.2, 53.1, 54.4))
Load from *.csv
file
If in working directory,
<- read.csv("hw.csv") hw
Can also use RStudio “Import Dataset” button in Environment tab (top left)
Built in Data Sets Can use data frames built into R, for example:
data("cars")
Run data()
to see all data sets
1.2.2 Accessing Data Frames
- Single column (alternative ways):
- will collapse Data Frame column into vector
$Height
hw1] hw[,
- keeps data frame structure only keeping 1 column
1, drop = FALSE] hw[,
- Single element:
hw$Height[3]
orhw[3,1]
- Single row:
hw[3,]
(similar to single column)
Full access
df[ r , c ]
- r can be empty, integer between 1 and number of row, vector of integers or a vector of TRUE/FALSE of length num rows (must be exact length).
- c can be empty, integer between 1 and number of row, vector of integers, a vector of TRUE/FALSE of length num rows (must be exact length) or vector of strings with column/variable names.
Examples (first 3 equivalent)
$ph
wq.red9]
wq.red[, "pH"]
wq.red[,
c(9,11)]
wq.red[, c("pH","alcohol")] wq.red[,
1.2.2.1 Advanced Query - Accessing a Subset of a Data Frame
Examples
- Entire column “Weight” with values >50
$Weight>50,] hw[hw
- Both columns, 4 random rows (using
sample()
)
sample(1:nrow(hw), 4),] hw[
-Removing NA values
!is.na(movies$budget),] #doesn't have any NA value movies[
Note: !
is the logical NOT operator and is.na()
is a function that acts on a vector giving a TRUE
/FALSE
vector as the output
- extra example
$pH>3 & wq.red$density<1, "alcohol"]
wq.red[wq.red#wq.red$pH>3 is a vector of TRUE/FALSE
#wq.red$density<3 is a vector of TRUE/FALSE
#& takes logical between two vectors
1.2.3 Interrogating Data Frames
Data Frame Functions - information
names()
- column namesdim()
- number of rows and columnsnrow()
- number of rowsncol()
- number of columnshead()
- useful for large data, just shows top rows (can add extra parameter to specify how many rows show up)str()
- shows details about type of data
Data Frame Functions - interesting data
colMeans()
rowMeans()
colSums()
rowSums()
cov()
- covariance matrixcor()
- correlation matrixscale()
- scales data to be centered at 0 and scaled (both have optional arguments available)summary()
- gives all major statistics for each variable (column)
Data Frame (column/row) Functions - sorting and ordering
(optional decreasing = TRUE
argument)
- `
sort(my.data$var)
- sorts a variable but only outputs that column vector sorted - `
order(my.data$var)
- outputs a list of indices sorted
Sorting a Data Frame by a Column
my.data[order(my.data$var),]
- sorts whole data frame according tovar
column
1.2.4 Manipulating Data Frames
Creating/Adding Variables to a Data Frame
Reference a variable that doesn’t exist and just assign it to something. For example,
$BMI <- hw$Weight/(hw$Height/100)^2 hw
1.2.4.1 Merging Data Frames
rbind(,)
and cbind(,)
Note: can be very error prone and often better to use tidyverse (unless with rbind
variables are identical and in same order or with cbind
, observations are in same order)
rbind()
pastes rows together (above/below)cbind()
pastes columns together (left/right)
For an example with rbind(,)
, if two data frames have same column names, rbind(,)
will stack the rows to make one data frame. As an example,
<- data.frame(Col1 = c(1,5,9,2),
test1 Col2 = c(6,9,8,3))
<- data.frame(Col2 = c(5,9,0,1),
test2 Col1 = c(4,9,3,0))
rbind(test1, test2)
## Col1 Col2
## 1 1 6
## 2 5 9
## 3 9 8
## 4 2 3
## 5 4 5
## 6 9 9
## 7 3 0
## 8 0 1
(notice how the columns were matched by name not order)
1.2.5 Missing Data in Data Frames
1.2.5.1 Importing Data with Missing Values
When using read.csv("mydata.csv")
, can add the additional argument na.strings=c(...)
to set any strings in the vector to be replaced by <NA>
.
As an example,
read.csv("carsdata.csv", na.strings = c("", "na"))
would import the cardata.csv
file as a data frame with all strings that are empty ""
or "na"
with the approptiate <NA>
tag.
Ignoring NA
values - na.rm = TRUE
na.rm = TRUE
argument ignores all <NA>
values when performing the function. As an example,
mean(mydata$var, na.rm = TRUE)
which is equivalent to
mean(na.omit(mydata$var))
1.3 Lists
<- list(1, "a", c(1,2,3), data.frame(a = 1:3, b = 4:6)) x
- can have a list of any type of variable
- can be good for hierarchical and tree structures, Nesting is permitted (i.e. lists can contain lists)
- variables in the list can also be named
<- list(bob = 1, jill = "a", jack = c(1,2,3),
x eve = data.frame(a = 1:3, b = 4:6))
$eve$a x
## [1] 1 2 3
1.4 Data Types
- Numeric
- Logical (
TRUE
/FALSE
) - Categorical (called factors in R), could be ordered (e.g. credit rating) or could not be (e.g. eye colour)
- Date/Time
- Text or String
- Others (e.g. image, spatial, audio, video)
1.4.1 Categorical Data (Factors)
1.4.1.1 Making Data Frames Categorical
R cannot tell the difference between factors and strings when importing data frames, so we must “tidy” them up after importing
Cleaning up a data frame (all non-numeric by default get imported as strings) by making all strings in one variable a factor
<- data.frame(name=c("anne","john","charlie","sarah","max","ellie","eve"),
eyesDF eyeColour=c("blue","green","brown","brown","blue","blue","brown"))
#changes a variable to be categorical (a factor)
$eyeColour <- as.factor(eyesDF$eyeColour)
eyesDF
summary(eyesDF$eyeColour)
## blue brown green
## 3 3 1
Treat data as a factor or a text?
- Treat as text not factor when every observation is unique (e.g. surname)
- When some text is coming up very often it may be more appropriate as a factor
1.4.1.2 Making Vectors Categorical
We can create a vector with factors in by creating the entries as strings and then applying the factor()
function
For example,
<- factor(c("blue","brown","green","blue","blue","brown","green","blue","blue","green","blue","green","blue","blue","brown","green","brown","brown","green"))
eye.colour summary(eye.colour)
## blue brown green
## 8 5 6
Example Using Factors The following data frame has a factor variable (feed)
data("chickwts")
head(chickwts)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
summary(chickwts)
## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12
We can filter out specific factors,
$feed %in% c("sunflower","linseed"),] chickwts[chickwts
Number of levels (different factors) and names of levels
nlevels(chickwts$feed)
## [1] 6
levels(chickwts$feed)
## [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
unclass()
gives each entry a number corresponding to a factor
unclass(chickwts$feed)
## [1] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6
## [39] 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1
## attr(,"levels")
## [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
1.5 Other
Packages
- Installing a package:
install.packages("dplyr")
- Loading a package:
library("dplyr")
Loops
for(i in x) {
#i iterates over the values in vector x
}
Functions
<- function(arg1, arg2 = 1) {
myFunction
...return(...)
}
Functions acting on vectors - sapply()
sapply(X, Fun)
Applies function Fun
to vector X
element-wise
Logical Operators
Logical operators act element-wise on pairs of vectors (of same size) of TRUE
and FALSE
values. AND is &
, OR is |
.