Chapter 11 DSSC - Data Wrangling, Presentation and Applications
To check for missing data use,
print(paste("Missing data:", sum(is.na(df$var)), sep=" ", collapse=""))
11.1 Data Wrangling with Tidyverse
Loading tidyverse
,
library("tidyverse")
11.1.1 Tidy Form (tidyr
)
What is tidy data?
- each variable is in a column
- each observation is in a row
- each type of observational unit forms a table
Moving to and from tidy data
Problems (how data may violate tidy form)
- Data is too wide - one variable spread over multiple columns (use
pivot_longer()
) - Data is too long - one observation spread along multiple rows (use
pivot_wider()
)
pivot_longer()
Makes Wide Data Longer
The arguments are:
- Data Frame
- Columns to transform
- Name of the column where previous column names should go
- Name of the column where values from the column should go
Example
who_wide
## country y1999 y2000
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
pivot_longer(who_wide,
c(`y1999`, `y2000`),
names_to = "year",
values_to = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan y1999 745
## 2 Afghanistan y2000 2666
## 3 Brazil y1999 37737
## 4 Brazil y2000 80488
## 5 China y1999 212258
## 6 China y2000 213766
pivot_wider()
Makes Long Data Wider
The arguments are:
- Data Frame
- Columns to transform
- Name of the column where column names should come from
- Name of the column where values should come from
Example
who_long
## country year type count
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
pivot_wider(who_long,
names_from = "type",
values_from = "count")
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Additional Example - DSSC Lab 5.6
pres.res
## Candidate California Arkansas
## 1 Clinton 8753788/14181595 380494/1130676
## 2 Trump 4483810/14181595 684872/1130676
## 3 Other 943997/14181595 65310/1130676
<- pivot_longer(pres.res,
pres.res2 c("California", "Arkansas"),
names_to = "State",
values_to = "Proportion")
pres.res2
## # A tibble: 6 × 3
## Candidate State Proportion
## <chr> <chr> <chr>
## 1 Clinton California 8753788/14181595
## 2 Clinton Arkansas 380494/1130676
## 3 Trump California 4483810/14181595
## 4 Trump Arkansas 684872/1130676
## 5 Other California 943997/14181595
## 6 Other Arkansas 65310/1130676
<- separate(pres.res2, "Proportion", c("Votes", "Total"))
pres.res3 pres.res3
## # A tibble: 6 × 4
## Candidate State Votes Total
## <chr> <chr> <chr> <chr>
## 1 Clinton California 8753788 14181595
## 2 Clinton Arkansas 380494 1130676
## 3 Trump California 4483810 14181595
## 4 Trump Arkansas 684872 1130676
## 5 Other California 943997 14181595
## 6 Other Arkansas 65310 1130676
<- mutate(pres.res3, Votes = as.numeric(Votes), Total = as.numeric(Total))
pres.res4 str(pres.res4)
## tibble [6 × 4] (S3: tbl_df/tbl/data.frame)
## $ Candidate: chr [1:6] "Clinton" "Clinton" "Trump" "Trump" ...
## $ State : chr [1:6] "California" "Arkansas" "California" "Arkansas" ...
## $ Votes : num [1:6] 8753788 380494 4483810 684872 943997 ...
## $ Total : num [1:6] 14181595 1130676 14181595 1130676 14181595 ...
<- pres.res4 |>
pres.res5 group_by(Candidate) |>
summarise(Percent = sum(Votes)/sum(Total)*100) |>
arrange(desc(Percent))
pres.res5
## # A tibble: 3 × 2
## Candidate Percent
## <chr> <dbl>
## 1 Clinton 59.7
## 2 Trump 33.8
## 3 Other 6.59
Other useful tidyr
functions
separate()
- splits one column of strings into multiple new columnsunite()
- combines many columns into one (as a string)extract()
- uses regular expressions to pull out specific information from a string column
Example
fball
## home away score
## 1 Man U Shef Wed 2-1
## 2 Tottenham Arsenal 0-0
## 3 Chelsea W Ham 1-0
separate(fball, "score", c("home_goals", "away_goals"))
## home away home_goals away_goals
## 1 Man U Shef Wed 2 1
## 2 Tottenham Arsenal 0 0
## 3 Chelsea W Ham 1 0
11.1.2 Data Manipulation (dplyr
)
Main dplyr
functions
(First argument is always the data frame)
filter()
- Focus on a subset of rows
Other Arguments
- condition to filter by
For example, filter(who, year == 1999)
(see above list of logical operators)
arrange()
- Reorder the rows
Other Arguments
- Variable names to sort by, sub-sorting by later variables
- Wrap variable name in
desc()
to sort descending (ascending by default)
For example, arrange(who, year, desc(country))
select()
- Focus on a subset of variables (columns)
Other Arguments
- Name of variables to retain
For example, select(who, year, cases)
mutate()
- Create new derived variables
Other Arguments
- Name of new variable and equation defining it
For example, mutate(who, rate = cases/population)
group_by()
- Splits a data frame up into groups according to one variable
Other Arguments
- Name of variable to group by
For example, group_by(who, country)
summarise()
- Create summary statistics (collapsing many rows) by groupings
Other Arguments
- Function to summarise by
For example, summarise(who, total = sum(cases))
Note: often want to summarise by group
For example,
<- group_by(who, country)
who2 summarise(who2, total = sum(cases), change = max(cases)-min(cases))
11.1.3 Pipelines
Chain functions (not limited to tidyverse functions) where result of first function is first entry in second function and so on.
Example,
filter(x, ...) |>
select(...) |>
mutate(...) |>
group_by(...) |>
arrange(...)
Pipeline Operator: CMD-SHIFT-M
11.1.4 Joining Data Frames in Tidyverse
Simplest case of joining data frames (more details in data frames section):
rbind()
- paste rows together (above/below)cbind()
- paste cols together (left/right)
These methods can be very error prone (requires variables/observations in identical order etc)
Advanced Data Frame Joins
left_join(x, y)
- add new variables from y to x, keeping all x obsright_join(x, y)
- add new variables from x to y, keeping all y obsinner_join(x, y)
- keep only matching rowsfull_join(x, y)
- keep all rows in both x and y
Example
band_members
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments2
## # A tibble: 3 × 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join(band_members, band_instruments2, by = c("name" = "artist"))
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
11.2 Dynamic Documents and Interactive Dashboards
11.2.1 RMD
Document Preamble
---
: "Example"
title: "(optional) Jamie Reason"
author: "(optional)"
date:
output: default
html_document: default
pdf_document---
For further formatting, refer to RMD Cheat Sheet
11.2.2 Shiny
Resource: Mastering Shiny Book
Outline
- UI
- Server
R code can be added to any part of a shiny document but only the code in the server will be updated when needed.
Starting a Shiny Dashboard (create a new shiny app in R studio):
fluidpage()
is just the most common but there are alternatives
library(shiny)
#misc code
<- fluidpage(
ui
...
)
<- function(input, output, session){
server #server code
}
shinyApp(ui, server)
11.2.2.1 UI
11.2.2.1.1 Pages
Examples
<- fluidPage(
ui "One",
"Two",
"Three"
)shinyApp(ui, server = function(input, output, session) {})
<- navbarPage(
ui "Title of page",
tabPanel("My first tab", "Hello Alice"),
tabPanel("My second tab", "Hello Bob")
)shinyApp(ui, server = function(input, output, session) {})
Other pages: fixedPage()
, fillPage()
, …
11.2.2.1.2 Layouts and Panels
Goes inside of the page
titlePanel("My App")
sidebarLayout()
- first argument
sidebarPanel()
- second argument
mainPanel()
- first argument
fluidrow()
- creates a new row with columns in- ```column() calls
- first a number 1 to 12 (all columns numbers must sum to 12) for width -other arguments are outputs
- ```column() calls
Examples
<- fluidPage(
ui titlePanel("My App"),
sidebarLayout(
sidebarPanel("I'm in sidebar"),
mainPanel("I'm in main panel")
)
)shinyApp(ui, server = function(input, output, session) {})
<- fluidPage(
ui fluidRow(
column(4, "Lorem ipsum dolor ..."),
column(8, "Lorem ipsum dolor ...")
),fluidRow(
column(6, "Lorem ipsum dolor ..."),
column(6, "Lorem ipsum dolor ...")
)
)shinyApp(ui, server = function(input, output, session) {})
11.2.2.1.3 UI Inputs
All inputs take same first argument - inputId
, the unique identifier of the input.
This can be accessed by using input$name
(in the server).
The second argument is a label, or how it’s name appears on the dashboard.
Text Inputs
textInput()
passwordInput()
textAreaInput()
Numeric Inputs
numericInput()
sliderInput()
Categoric Inputs
selectInput()
radioButtons()
checkboxGroupInput()
Examples
<- fluidPage(
ui numericInput("num", "Number one", value = 0, min = 0, max = 100),
sliderInput("num2", "Number two", value = 50, min = 0, max = 100),
sliderInput("rng", "Range", value = c(10, 20), min = 0, max = 100)
)shinyApp(ui, server = function(input, output, session) {})
<- c("dog", "cat", "mouse", "bird", "other", "I hate animals")
animals <- fluidPage(
ui selectInput("state", "What's your favourite state?", state.name),
radioButtons("animal", "What's your favourite animal?", animals),
checkboxGroupInput("animal2", "What animals do you like?", animals)
)shinyApp(ui, server = function(input, output, session) {})
11.2.2.2 Server and UI Outputs
All outputs take same first argument, outputId
and an output can be called by output$name
.
11.2.2.2.1 UI Outputs
Text Outputs
textOutput()
renderText()
verbatimTextOutput()
renderPrint()
Plot Outputs
plotOutput()
andrenderPlot()
width
argumentres = 96
argument closest to what you see inj RStudio
Examples
<- fluidPage(
ui textInput("name", "What's your name?"),
textOutput("greet")
)<- function(input, output, session) {
server $greet <- renderText({
outputif(nchar(input$name) > 0) {
return(paste0("Hello ", input$name))
else {
} return("Hello friend, tell me your name!")
}
})
}shinyApp(ui, server)
<- fluidPage(
ui plotOutput("myplot", width = "400px")
)<- function(input, output, session) {
server $myplot <- renderPlot({
outputplot(iris$Sepal.Length, iris$Sepal.Width)
res = 96)
},
}shinyApp(ui, server)
11.2.2.2.2 Variables outside outputs (reactive)
Instead of making variables in the server (which you can’t do as they wouldn’t be reactive), you use the reactive({})
call:
Inside the server,
<- ... name
becomes,
<- reactive({
name
... })
And when name
is used it should be called as name()
Examples
<- function(input, output, session) {
server <- reactive({
name toupper(input$name)
})$greet <- renderText({
outputif(nchar(input$name) > 0) {
return(paste0("Hello ", name(), ", here is your plot ..."))
else {
} return("Hello friend, tell me your name!")
}
})$myplot <- renderPlot({
outputif(nchar(input$name) > 0) {
ggplot(iris, aes_string(x = input$xvar, y = input$yvar)) +
geom_point() +
labs(title = paste0(name(), "'s plot!"))
}res = 96)
}, }
11.2.2.3 Full Example
From exercise 5.78 (Lab 8)
library("shiny")
library("ukpolice")
library("tidyverse")
library("leaflet")
<- ukc_neighbourhoods("durham")
nbd <- nbd$id
nbd2 names(nbd2) <- nbd$name
# Define UI for application
<- fluidPage(
ui titlePanel("UK Police Data"),
sidebarLayout(
sidebarPanel(
selectInput("nbd", "Choose Durham Constabulary Neighborhood", nbd2),
textInput("date", "Enter the desired year and month in the format YYYY-MM", value = "2021-09")
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("barchart"),
leafletOutput("map")
)
)
)
# Define server logic
<- function(input, output) {
server # Get boundaries for selected neighbourhood
# Wrapped in a reactive because we need this to trigger a
# change when the input neighborhood changes
<- reactive({
bdy <- ukc_neighbourhood_boundary("durham", input$nbd)
bdy |>
bdy mutate(latitude = as.numeric(latitude),
longitude = as.numeric(longitude))
})
# Get crimes for selected neighbourhood
# Also wrapped in a reactive because we need this to trigger a
# change when the boundary above, or date, changes
<- reactive({
crimes <- bdy() |>
bdy2 select(lat = latitude,
lng = longitude)
ukc_crime_poly(bdy2[round(seq(1, nrow(bdy2), length.out = 100)), ], input$date)
})
# First do plot
$barchart <- renderPlot({
outputggplot(crimes()) +
geom_bar(aes(y = category, fill = outcome_status_category)) +
labs(y = "Crime", fill = "Outcome Status")
res = 96)
},
# Then do map
$map <- renderLeaflet({
outputleaflet() |>
addTiles() |>
addPolygons(lng = bdy()$longitude, lat = bdy()$latitude) |>
addCircles(lng = as.numeric(crimes()$longitude), lat = as.numeric(crimes()$latitude), label = crimes()$category, color = "red")
})
}
# Run the application
11.3 Dates
(see DSSC Lab 9)
Use lubridates package
library("lubridate")
## Warning: package 'lubridate' was built under R version 4.1.2
## Loading required package: timechange
## Warning: package 'timechange' was built under R version 4.1.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
###Creating Dates {-}
Current date and time
today()
## [1] "2023-01-15"
now()
## [1] "2023-01-15 16:34:16 GMT"
str(today()) #these are dates not strings
## Date[1:1], format: "2023-01-15"
Constructing dates from strings and numbers
ymd("2021-12-02")
## [1] "2021-12-02"
mdy("December 2nd, 2021")
## [1] "2021-12-02"
ymd(20211202)
## [1] "2021-12-02"
ymd_hms("2021-12-02 12:33:59")
## [1] "2021-12-02 12:33:59 UTC"
Constructing dates and times from individual components
make_date(2021, 12, 2)
## [1] "2021-12-02"
make_date("2021", "12", "2")
## [1] "2021-12-02"
make_datetime(2021, 12, 2, 12)
## [1] "2021-12-02 12:00:00 UTC"
make_datetime(2021, 12, 2, 12, 33, 59)
## [1] "2021-12-02 12:33:59 UTC"
11.3.1 Time Zones
Date creation functions take an argument tz = "America/New_York"
.
now(tz = "America/New_York")
## [1] "2023-01-15 11:34:16 EST"
To see all avaliable zones call OlsonNames()
Changing Time Zone
#forces change of time zone without changing date/time
<- ymd_hm("2019-12-02 15:10")
x force_tz(x, "America/New_York")
## [1] "2019-12-02 15:10:00 EST"
#converts date/tine to a new time zone
with_tz(x, "America/New_York")
## [1] "2019-12-02 10:10:00 EST"
11.3.2 Extracting From Dates
<- today()
datetime year(datetime)
## [1] 2023
yday(datetime)
## [1] 15
wday(datetime, week_start = 1) #by default, sunday is first day of week, use this to make it monday
## [1] 7
month(datetime, label = TRUE)
## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
Rounding Dates/Times
floor_date(datetime, unit = "minute")
## [1] "2023-01-15 UTC"
ceiling_date(datetime, unit = "week")
## [1] "2023-01-22"
ceiling_date(datetime, unit = "quarter")
## [1] "2023-04-01"
floor_date(datetime, unit = "week", week_start = 1)
## [1] "2023-01-09"
11.3.3 Misc
Updating Dates/Times
<- ymd_hms("2021-12-02 12:33:59")
datetime <- update(datetime, hour = 11, second = 33)
datetime datetime
## [1] "2021-12-02 11:33:33 UTC"
#Alternatively,
<- ymd_hms("2021-12-02 12:33:59")
datetime hour(datetime) <- 11
second(datetime) <- 33
datetime
## [1] "2021-12-02 11:33:33 UTC"
Durations
Can do arithmetic with dates and times
<- dmy("14th March 1879")
einstein <- today() - months(42) - einstein #age 42 months ago
age age
## Time difference of 51257 days
Get a duration after arithmetic using as.duration()
as.duration(age)
## [1] "4428604800s (~140.33 years)"
11.4 Strings and Regular Expressions
11.4.1 Strange characters
When you want a string with strange characters, enclose it in r"(...)"
instead of just "..."
.
<- r"(As Roosevelt said,
z "Believe you can and you're halfway there."
)"
cat(z)
## As Roosevelt said,
## "Believe you can and you're halfway there."
cat()
is like a print command
11.4.2 stringr
(part of tidyverse)
Most stringr
functions begin with str_
so can use autocomplete for many string operations.
Basics
String Length
str_length(c("Data Science and Statistical Computing", "by", "Dr Louis Aslett"))
## [1] 38 2 15
Combining Strings
str_c("Data Science and Statistical Computing", "by", "Dr Louis Aslett")
## [1] "Data Science and Statistical ComputingbyDr Louis Aslett"
str_c("Data Science and Statistical Computing", "by", "Dr Louis Aslett", sep = " ")
## [1] "Data Science and Statistical Computing by Dr Louis Aslett"
str_c(c("Data Science and Statistical Computing", "by", "Dr Louis Aslett"))
## [1] "Data Science and Statistical Computing"
## [2] "by"
## [3] "Dr Louis Aslett"
str_c(c("Data Science and Statistical Computing", "by", "Dr Louis Aslett"), collapse = " ")
## [1] "Data Science and Statistical Computing by Dr Louis Aslett"
Subsetting Strings
<- c("Alice", "Bob", "Connie", "David")
z str_sub(z, 1, 4)
## [1] "Alic" "Bob" "Conn" "Davi"
str_sub(z, 1, 2) <- "Zo"
z
## [1] "Zoice" "Zob" "Zonnie" "Zovid"
Trimming
str_trim(" String with trailing, middle, and leading white space\n\n")
## [1] "String with trailing, middle, and leading white space"
str_squish(" String with trailing, middle, and leading white space\n\n")
## [1] "String with trailing, middle, and leading white space"
11.4.2.1 Regex’s
See all details in docs or lecture slides
Regex’s are used for finding patterns in strings
str_view()
Identify a pattern in a string:
Exact matching
str_view("string to find pattern in", "pattern")
## [1] │ string to find <pattern> in
Wildcard matching
<- c("apple", "banana", "pear")
x str_view(x, ".a.")
## [2] │ <ban>ana
## [3] │ p<ear>
How to match a .
? - str_view(c(".bc", "a.c", "be."), "a\\.c")
(use \
but make sure to escape it)
Anchoring
To start:
str_view(x, "^a")
## [1] │ <a>pple
To end:
str_view(x, "a$")
## [2] │ banan<a>
can also anchor to both.
Matching Set of Characters I
Find exactly first character that matches:
str_view(x, "[pan]")
## [1] │ <a><p><p>le
## [2] │ b<a><n><a><n><a>
## [3] │ <p>e<a>r
Find one or more instance consecutively:
str_view(x, "[pan]+")
## [1] │ <app>le
## [2] │ b<anana>
## [3] │ <p>e<a>r
Find exact number of instances occurring consecutively:
str_view(x, "[pan]{2}")
## [1] │ <ap>ple
## [2] │ b<an><an>a
Find a range or instances occurring consecutively:
str_view(x, "[pan]{1,3}")
## [1] │ <app>le
## [2] │ b<ana><na>
## [3] │ <p>e<a>r
Matching Set of Characters II
<- c("There were 122 in total", "Overall about 390 found", "100 but no more")
y str_view(y, "[0-9]+")
## [1] │ There were <122> in total
## [2] │ Overall about <390> found
## [3] │ <100> but no more
str_view(y, "[^A-Za-z ]+") #^ anchor inside so acts as a negation
## [1] │ There were <122> in total
## [2] │ Overall about <390> found
## [3] │ <100> but no more
str_view(y, "^[0-9]+") #^ anchor on outside
## [3] │ <100> but no more
str_view(y, "[a-z ]+")
## [1] │ T<here were >122< in total>
## [2] │ O<verall about >390< found>
## [3] │ 100< but no more>
11.5 Probability Distributions
Letter | Function | Use |
---|---|---|
“d” | dnorm() |
evaluates pdf \(f(x)\) |
“p” | pnorm() |
evaluates cdf \(F(x)\) |
“q” | qnorm() |
evaluates inverse cdf \(F^{-1}(q)\) i.e. \(P(X \leq x) = q\) |
“r” | rnorm() |
generates random numbers |
Parameters will vary, e.g.
- Normal distribution:
dnorm
,pnorm
,qnorm
,rnorm
. Parameters:mean
(\(\mu\)) andsd
(\(\sigma\)). - t distribution:
dt
,pt
,qt
,rt
. Parameter:df
- \(\chi^2\) distribution:
dchisq
,pchisq
,qchisq
,rchisq
. Parameter:df
11.5.1 DSSC Theory Applications
11.5.1.1 Monte Carlo Hyothesis Test
Example 2.1
# Specify test statistic and null value
<- 8.6
x.bar <- 6
n <- 9.2
mu0
# Simulate lots of data assuming the null is true
<- rep(0, 50000)
t
for(j in 0:50000) {
<- rnorm(n, mu0, sqrt(0.4)) #random sample (of n=6) generated under H0
z <- abs(mean(z)-mu0) #difference in mean of random sample and mean under H0 assumption
t[j]
}
# Calculate empirical p-value
sum(t > abs(x.bar-mu0)) / 50000 #number of random samplea that were at least as far from mu0 as observation
11.5.1.2 Boot Strap
Set-up
- Sample of size \(n\) independent samples
- There is a statistic \(S( \cdot )\) we wish to estimate
- We also want the standard error of this
General Method:
- Draw \(B\) new samples of size \(n\) with replacement from \(\mathbf{x} = (x_1, \ldots , x_n)\)
- Call these samples \(\textbf{x}^{\star 1}, \ldots , \textbf{x}^{\star B}\)
- Calculate the estimate, \(\bar{S}^{\star}=\frac{1}{B} \sum_{b=1}^{B} S\left(\mathbf{x}^{\star b}\right)\)
- Calculate the variance, \(\widehat{\operatorname{Var}}(S(\mathbf{x}))=\frac{1}{B-1} \sum_{b=1}^{B}\left(S\left(\mathbf{x}^{\star b}\right)-\bar{S}^{\star}\right)^{2}\)
Example 3.1 (Also see 3.5)
# Mouse data
<- c(94,197,16,38,99,141,23)
x
# Number of bootstraps
<- 1000
B
# Statistic
<- mean
S
# Perform bootstrap
<- rep(0, B)
S.star for(b in 1:B) {
<- sample(x, replace = TRUE)
x.star <- S(x.star)
S.star[b]
}
# Bootstrap estimate
mean(S.star)
# Standard error of estimate
sd(S.star)
Empirical CDF - ecdf(x)