This page contains material for NSCR Carpentries, a two-day workshop hosted by the NSCR with support from Data and Software Carpentries on 5 and 12 October, 2022.
R is a programming language and open source software environment. Although it is widely described as a statistics programme, R can be used for a variety of other tasks, such as text analysis, web scraping, mapping and spatial analysis, web dashboards and document preparation. Even this worksheet was created in R – so is the NSC-R team’s website.
In reality, many of us will spend most of our time on importing, tidying, transforming and visualising our data. This is like the infamous ‘80:20’ rule. 80% of your time will be spent on wrangling with the data, 20% will be spent on actually running fancy statistical models. You might get the answers you want without even running any fancy statistics!
So, this is what we focus on: wrangling and visualisation. Getting these basics nailed early on will drastically improve the efficiency of many research tasks through automation, and open up a variety of new tools that are not available in more ‘traditional’ software like Excel. Perhaps even more importantly, it will also make your research transparent and reproducible (both for yourself, and others).
Although R is available as a stand-alone piece of software, it is most commonly used through a user-friendly interface RStudio. We recommend that you first download R and then download RStudio. All the examples in this workshop will be done in RStudio. We’ll cover (or have covered) the basics of the RStudio interface during live demonstrations, but as a reminder, there are four main windows: the script editor, console, global environment and help/plots.
The RStudio environment.
A lot of things in RStudio are easier if you work inside what’s called a project. Projects let you keep all your R code for a particular piece of work self-contained, and mean you can switch between projects while keeping all the tabs in RStudio just as you left them. Projects also make it easier to collaborate with other people, because irrespective of what computer you are working on, you will all be working within the same folder structure. It’s a good idea to always work inside a project if you are using RStudio.
Luckily for you, assuming you have downloaded all the workshop material, you will already have a pre-prepared project. Your folder structure will look something like the screenshot below. You will notice a file called nscr_carpentries which has the file extension .Rproj. This is the R project for the workshop. Throughout today and next week, we recommend that you always work within this project. Opening this file will automatically open RStudio. You can see which project you are working on in the top right. This should confirm that you are working within the nscr_carpentries project.
Left: The R project file in your materials folder. Right: Confirming you are working within the project in RStudio.
For future reference, and your own work going forward, you can set up an R project yourself with just a few clicks. We recommend this brief tutorial. But, there’s no need to do this now, unless you really want to!
Code in R tends to be made up of two parts: functions and objects.
Functions do things to whatever you input. This might be as simple as
calculating the mean of a list of numbers. An object is anything that
you create, and can take any form. A common example might be a data
frame, but objects can also be things like a list of words, a single
number, or a graphic. You can create an object by using the assign
operator <-
. Anything on the left hand side of
<-
will be the object that you are creating, and
anything to the right is going to define what that object is, often
through the use of a function. Andy
Field summarises this as object <- function
. You can
read this as ‘object is created from function’. You can call objects
whatever you want, but it’s good practice to give them meaningful
names.
In the example below, we create an object without even using a
function. Type (or copy) this into your script, highlight it (or click
on the line of code) and hit Ctrl + Enter
to execute it.
For Mac, this shortcut is Command + Enter
.
city <- "Amsterdam"
The executed code will appear in the console. You’ll notice that
nothing else actually happens: all you’ve done is create an object
called city
which contains the word “Amsterdam”. This
object will appear in your global environment in the top-right of
RStudio. If you want to print the contents of an object to the console
simply execute the name of the object with Ctrl + Enter
.
For example:
city
## [1] "Amsterdam"
Be careful when typing: R is case-sensitive! It wouldn’t recognize
City
or CITY
even if the object
city
exists.
You can assign anything to an object. If we want to assign a series of numbers (or words), we use a similar concept to concatenate in Excel, simply to let R know that we are giving it more than one of something.
my_numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
Functions can now be applied to this object. To find the mean, we can simply run:
mean(my_numbers)
## [1] 5
Although if you wanted to save the answer, you’d need to assign it to an object too. For example:
mean_my_numbers <- mean(my_numbers)
You can remove objects from your global environment using the
rm()
function. Note that re-assigning something to an
existing object with the same name will simply overwrite the original.
So, if you don’t want to lose anything, use new names for any new
objects you create. The following would get rid of the
mean_my_numbers
object.
rm(mean_my_numbers)
If you want to know more about a function, you can type
?
before the function name and then run the code. For
example running ?mean
will bring up the help page for the
function mean
in the help/plot window in the bottom right
of your interface.
So far, we’ve been creating what are termed vectors. Vectors
are the most basic structure of data in R, and are simply just a
collection of things which are the same type. You can check what type a
vector is using the function typeof()
, or the (more broad)
class with class()
. There are a few different types and
classes of vector in R, and we recommend that you read this article
about object types to understand them more.
A data structure you will be familiar with is a data frame. In R, you can think of a data frame as a list of vectors (i.e. columns, variables) of equal length (i.e. same number of rows). For instance, we can create a new data frame in a similar manner to how we created vectors above, but pre-specify that we want to stick these together to create a data frame. This data frame has three variables (vectors, columns) of the same length (3 rows).
example_df <- data.frame(var1 = c("id1", "id2", "id3"),
var2 = c(1200, 5444, 8333),
var3 = c("Amsterdam", "Groningen", "Delft"))
example_df
## var1 var2 var3
## 1 id1 1200 Amsterdam
## 2 id2 5444 Groningen
## 3 id3 8333 Delft
You can ‘access’ the vectors within this dataframe using the
$
symbol. For example, to print out the contents of
var3
we can just run example_df$var3
. We don’t
dwell too much on this here, but for now, it’s important to be aware
that the type of vector (i.e., column) you are dealing with will dictate
what you can do with it. We take a look at this briefly in the exercise
below. Whilst some of this might seem obvious, it’s worthwhile being
aware of these different types and classes. You can complete the
workshop without looking at the links above, but we recommend you return
to it later on!
my_numbers
vector? Don’t guess: check
it using code.$
symbol, calculate the mean of
var2
in the example_df
data frame above.var3
variable. What happens
and why?R comes with a huge amount of built-in functionality. You can load,
manipulate, analyse and visualise data simply with the tools you have
already upon download. In the above example, we used the function
mean()
. This is an example of a function which is available
to use as standard in R. That said, the real power and versatility of R
comes with additional packages which you can install (for free)
from within the software. There are thousands of R packages out there,
but there is a specific group of packages which have been designed
around a common philosophy, and can be used in conjunction with one
another. Collectively these are referred to as the
tidyverse
. You can read more about the tidyverse in the
(legitimately) free book R for Data
Science, which covers the fundamentals of data science in R.
Packages within the tidyverse.
To install the tidyverse in R, you can run the following code. Once
installed, you tend not to have to re-install packages, so this is a
rare occasion when you can simply type or copy and paste the code into
the console and hit Enter
, rather than typing it into the
script editor. Because the tidyverse is quite extensive, containing a
number of packages, the install may take a few minutes.
install.packages('tidyverse')
A summary of what has been installed will appear in the console. Warnings in red are not always bad, and we will cover them more in the live demonstration. You now have the tidyverse installed!
Installing packages is a one-time thing, but each time you open a new
session in R, you need to load the packages you want to use. You can do
this by ticking the box of whichever one you want to load in the
packages tab of the help/plots window. However, it is good
practice to load packages using code. You can do this using the
library()
function. Copy the following to load the
tidyverse.
library(tidyverse)
Most of the time, you are going to be loading in your own data. This might be stored in a variety of formats, such as .dta or .sav depending on what software you are currently using, or you might have geographic data (for example .shp files). The chances are that R can load them, it just might require a different function.
A common format for storing data frames is .csv so that’s the example
we’re going to use here. We’re going to use the read_csv()
function from the readr
package within the tidyverse and
assign the data to a new object. The contents of this function will be
where the data is stored. If it’s online, this will be a url, but if
it’s on your local drive, then it will be a working directory. I’ve put
some data on github for this workshop, so we can load it simply by
running this, using the url where the data is stored.
stops <- read_csv("https://github.com/langtonhugh/ESC2019_materials/raw/master/2018-12-greater-manchester-stop-and-search_edit.csv")
This will create an object called stops
, but you can
call it whatever you want. You’ll notice that a summary of this object
will appear in your global environment. If you want to save the data for
later use offline, first download the file to a local drive. You can do
this manually, by following the link and saving as you might normally,
but you can actually download the data within R using
download.file()
. The first ‘argument’ called
file
tells R where the data is, and the second argument
destfile
tells R where to save the data and what to call
it. Assuming that you are working inside the RStudio project, the file
location specified in destfile
is relative to your project
directory, which keeps everything self-contained.
download.file(url = "https://github.com/langtonhugh/ESC2019_materials/raw/master/2018-12-greater-manchester-stop-and-search_edit.csv", destfile = "data/stops_data.csv")
It’s always a good idea to store an unaltered copy of any data on your computer, just in case the original version is later removed from the website you downloaded it from.
data
folder in order to check you have
downloaded it properly. Does the logic of the folder structure, and
relative file path from the R project, make sense?read_csv()
but instead of the URL, you will need to
refer to the file location in your computer.Note again that this example demonstrates one benefit of using
projects: we do not have to specify the whole working directory path
(e.g., C:/users/joedart/myproject ...
) because everything
starts from the project folder by default.
Once you’ve loaded data into R (whichever way you’ve done it above),
there’s a few useful functions you can use to explore the data itself.
We’ve given you a few examples in the exercise below. You’ll notice that
for the first function, we’ve added a comment as a brief explanation.
Anything you write after a #
is treated as a comment and is
ignored by R when running chunks of code. It’s useful to add comments as
you go when writing code, explaining what you are doing and why. You
will thank yourself for doing this when returning to code you wrote
months earlier.
head(stops) # Print the top 6 rows (i.e. observations)
View(stops)
sum(is.na(stops))
nrow(stops)
summary(stops)
glimpse(stops)
head()
prints the top six rows of
your data frame. What function can we use to print the
bottom six rows? Remember: Google is your friend.stops
data frame?clothing_rm
?stops
. We haven’t told you the
function for this, but you might be able to guess. Failing that,
Google!As we saw earlier, and as you might have worked out in the exercise
above, we can make use of the $
sign to generate
descriptives about specific vectors (variables) within a data frame. To
refer to a specific variable, you need to state the data frame object
and then the variable name using this $
symbol. So, to
count the number of missings for a specific variable age
in
stops
you would run:
sum(is.na(stops$age)) # There are 136 missings in the age variable
Or to print a basic frequency table:
table(stops$age)
As we’ll see, one of the benefits of using the tidyverse is that you
often don’t need to use $
, which can look quite messy, but
it’s definitely worth knowing. For instance, we could replicate the
frequency count above using count()
which is part of the
dplyr
package in the tidyverse
.
count(stops, age)
So far, we haven’t really made much use of the tidyverse, but the
some of the most useful functions are coming up. They are often termed
‘verbs’ because the function name corresponds more or less to what it’s
doing to the object. Here, we focus on verbs that transform
data: things like filtering, selecting columns and renaming variables.
You will notice that we are covering transform before
tidy. The precise ordering of your own workflow might
differ, but one way or another, these transform functions will be super
useful early on in own work. The key package within the
tidyverse
for transforming data is dplyr
.
We are going to cover some of the most common:
filter()
to choose rows in a data frame by one or more
conditionsselect()
to choose columns by namearrange()
to order rows according to the values of one
or more columnsslice()
to choose rows by row numberrename()
to rename variablesmutate()
to create new variablesrecode()
to recode variablesAlong the way, we will also cover something called a pipe (coded as
%>%
). The pipe will help us link all these helpful verb
functions together, ensuring we write code which is easily written, read
and understood! We will then take a look at grouping (using
group_by()
) and conditional statements (using
if_else()
).
We’ll be using the stop and search data again for this. Let’s say we
are only interested in stop and search incidents in which the individual
was arrested. We can filter the outcome
variable by
‘Arrest’ and assign the new data frame to a new object. The first
argument of filter()
is the data frame, and the second is
the condition. In this example we can do:
stop_arrests <- filter(stops, outcome == "Arrest")
View(stop_arrests)
.outcome
, or running a frequency count. Try out as many as
you can.In some cases, you might only be interested in a small number of variables. Here, we select only the gender, age and ethnicity of the individual, along with the outcome of the stop and search, and create a new data frame. Again, the first argument is the data frame object, and the remaining are optional ones specifying the names of the variables.
stops_subset <- select(stops, gender, age, ethnic, outcome)
If you are interested in all but one or two variables, it’s easier to
specify what you don’t want rather than what you do
want. Notice that by assigning this new select()
function
to an object stops_subset
we are overwriting the original.
It doesn’t matter here, but it’s worth remembering. Generally speaking,
it’s best to try and not overwrite any existing objects unless
having too many is slowing your computer down. By keeping them all, you
retain a trail of everything you’ve done, and can go back to old objects
later on.
legislation
. In other words, we just
want to remove this column from our dataset. How can we do this? There
are multiple ways to do it using select()
, but some are
more concise than others. What’s the shortest bit of code you can come
up with?To order the rows of a data frame, we can use arrange()
.
So, for example, if we want rows to appear in chronological order by
time from top to bottom, we can run:
stops_ordered <- arrange(stops, time)
arrange()
. Take a look the
help documentation of the function for guidance
(?arrange
).time
column? If you want to learn
more about manipulating time variables, take a look at lubridate. Time variables
are a topic in themselves: let us know if you are interested in learning
more, or doing an exercise on it.You might want to select a specific set of observations using
slice()
. This is especially useful if you have your data in
a meaningful order. Here, we just use it to demonstrate selecting the
first 10 rows of the stops
data frame. We are using the
:
symbol to denote a sequence from 1 to 10.
stops_first10 <- slice(stops, 1:10)
arrange()
, create a new dataframe which contains the first
ten stop and search incidents in chronological (time
)
order.slice()
function is just one way of selecting rows
from a dataframe. What if we wanted to take a random sample of rows
(observations)? We can’t use slice()
for this, but there
are a number of useful functions in the tidyverse
(specifically, dplyr
) which we can use. Find out how we
could take a random sample of 100 stop and search incidents from the
original stops
dataframe.Often, data has variables with excessively long names, or names which
are not clear. The data here is not that bad, but some might need
changing. For example clothing_rm
tells us whether clothing
had to be removed as part of the stop and search, but this might not be
clear to some people. The rename()
function makes changing
variable names fairly easy. Note that because we are changing a name, we
just assign the change to the same object.
stops <- rename(stops, clothing_remove = clothing_rm)
type
to type_of_search
.janitor
package. The janitor
package is not
part of the tidyverse
but it is designed to be compatible,
so the functions in the package can be used just like we are learning
here. You don’t need to use it here (unless you want to), but it is a
super useful function that we think you should know about.To simplify, ease interpretation, or for analysis, you might want to
create a new variable. This new variable might be entirely new, or a
construction from an existing variable. This is where we can use the
mutate
function.
In the first example, let’s say we want to create an ID column for
the stops
data frame, whereby each row is assigned a number
for the purposes of identification. Note that we make things a bit
easier for ourselves here by using the nrow()
function from
earlier, instead of manually stating the number we need. We also use a
colon :
again to dictate that we want a sequence to run
from 1 to the number of rows in stops
. We are calling the
new variable ‘ID’.
stops_with_id <- mutate(stops, ID = 1:nrow(stops))
mutate()
the new
variable is appended to the end of the dataframe. For an ID variable, we
might intuitively want it at the beginning (i.e., the first column).
Move the position of our new ID variable to the first column. Clue: you
can use select()
perfectly fine, but there are also other
variations which can give you a more concise solution.More commonly, you are going to recode existing variables to create a
new one. Let’s say we just want a two category variable indicating
whether the individual was below 18. We need to combine categories in
the age
variable to do this. You’ll notice that we are
sandwiching the original categories with backward apostrophes (`). This
is something that is needed when you are referring to variable
categories or variable names which either have spaces or
begin with a number. This kind of thing will take a while to
get used to, but once you are used to these little rules, they will
become second nature.
stops <- mutate(stops, age_recode = recode(
age,
`10-17` = "Below 18",
`18-24` = "Over 18",
`25-34` = "Over 18",
`over 34` = "Over 18",
`under 10` = "Below 18"
))
The new variable age_recode
has now been created and
appended to the stops
data frame.
%>%
You have now learnt a bit about some key verbs within the tidyverse.
You’ll have noticed that the first part of each function is the data
frame object. For example,
stops_ordered <- arrange(stops, day)
is creating a new
object called stops_ordered
which is stops
ordered by day
. In more complex scripts, this can get quite
repetitive and make your code unnecessarily long. There is a useful
feature in the tidyverse %>%
which allows you to ‘pipe’
the data frame through multiple lines of code, and through multiple verb
functions. Using the pipe can make your code tidier and more
concise.
In a basic example, we can rewrite the above arrange
example using a pipe.
# This:
stops_ordered <- arrange(stops, day)
# Does exactly the same as this:
stops_ordered <- stops %>%
arrange(day)
As it stands, this isn’t really more efficient. However, as soon as we start doing more complex things, you will see the benefits. This because you can chain multiple pipes together, like this pseudo-code:
df %>%
select(some_variables) %>%
arrange(one_thing) %>%
filter(another_thing == "this") %>%
slice(1:10)
Given the above example, try the exercise below.
%>%
), carry out the
following:stops
data frame.gender
, age
,
ethnic
, outcome
, and
legislation.
.ethnic
to ethnicity
.tidyverse
(specifically, tidyr
) this. Look it
up!stops_piped
.Now we have covered pipes, let’s look at another super-useful
functions: group_by()
. This a key
transform variable, but it is best used in conjunction
with the pipe %>%
.
To demonstrate group_by()
, we want to make use of a new
package. Usually you’ll download packages because they contain useful
functions which allow you to do something, like analyse or visualise
data (e..g, tidyverse
), but packages often also contain
data. You’ll have got a few with the tidyverse, but they are not crime
specific. Lucky for you, Matt Ashby
has released a package called crimedata
which contains lots
of open data about police recorded crime from a number of large cities
in the United States. Install this package by editing the
install.packages()
code above for the
crimedata
package and then loading it using
library()
.
We can now easily access some US crime data using the function
get_crime_data()
. Let’s just choose the core datasets from
the year 2017. We assign the data to the object crimes
.
crimes <- get_crime_data(years = 2017, type = "core")
We can practice some of our skills earlier to explore the data. You will notice that each row is an offence For each offence, we information like the type of offence, the city and the census block in which it occurred.
Before we go any further, let’s just subset the data to only include
disorderly conduct offences. Do this now using the skills you learnt
earlier, and assign the subset to a new dataframe object called
disorder
.
You will notice that there is a variable called
city_name
which tells us the city in which the crime
occurred. Because each row is a crime, city names are repeated across
rows, because each city had multiple offences during 2017. In this way,
you can think of the data as ‘grouped’ because individual offences are
nested with cities.
A typical research question when looking at this data might be
How many crimes occurred in each city?. There are a few
ways we can answer this question, but the most versatile is
group_by()
. This function allows us to perform operations
(i.e., use functions, do stuff) on groups in our data. Once we group the
data using group_by()
, any subsequent functions will be
applied to the group, rather than individual rows. So, to count the
number of crimes (i.e., rows) nested within each city, we first group
the data, then pipe the grouped data through to
summarise()
, in which we create a new variable called
count
which is simply the number (n
) of
rows.
city_counts <- disorder %>%
group_by(city_name) %>%
summarise(counts = n())
city_counts # print the data frame to the Console
## # A tibble: 16 x 2
## city_name counts
## <fct> <int>
## 1 Austin 4208
## 2 Boston 781
## 3 Chicago 232
## 4 Detroit 295
## 5 Fort Worth 373
## 6 Kansas City 1594
## 7 Los Angeles 1227
## 8 Louisville 42
## 9 Memphis 121
## 10 Mesa 1121
## 11 Nashville 22
## 12 New York 1085
## 13 San Francisco 497
## 14 St Louis 1042
## 15 Tucson 4043
## 16 Virginia Beach 90
You will notice that summarise()
create a new data frame
at whatever level we have specified in group_by()
. This is
subtly (but importantly) different from mutate()
which adds
a new variable to your existing data frame. Sometimes, using
mutate()
after group_by()
does actually make
sense, but what do you think would happen? Mini-exercise: try running
the above code using mutate()
and observe what happens to
the output. Remember that you might want to create a new data frame
object to compare the two.
One of the powerful things about group_by()
is that you
can input multiple groups. Say we were interested in offence counts in
each city, but also by the census blocks nested within each city. Then
we could try something like this:
city_block_counts <- disorder %>%
group_by(city_name, census_block) %>%
summarise(counts = n())
Take a look at the city_block_counts
object using
View()
to examine what has happened.
So far, we have been using n()
to count the number of
rows in each unique group. But, there are plenty of other things you can
do. For instance, you can also calculate the mean or standard deviations
of groups in exactly the same way, but instead of n()
you
would state something like mean(variable_name)
.
To demonstrate this quickly, we load in calls for service data from
Detroit, United States, during 2019. First, take a moment to understand
the structure of the data using some of the functions we learnt earlier,
such as glimpse()
and View()
.
detroit_df <- read_csv("https://github.com/langtonhugh/nscr_graphics/raw/main/data/detroit_calls.csv")
You will notice that each row is an incident, and for each incident we have information, including the call classification, priority level, and the time spent at different stages of the response, amongst other things. Here, we group by the priority level of the call (1=highest priority, 5=lowest priority) and then calculate the mean. Note that we remove missings when making the calculations.
detroit_df %>%
group_by(priority) %>%
summarise(mean_resonse = mean(totalresponsetime , na.rm = TRUE))
## # A tibble: 6 x 2
## priority mean_resonse
## <dbl> <dbl>
## 1 1 11.2
## 2 2 11.8
## 3 3 10.6
## 4 4 10.7
## 5 5 10.1
## 6 NA 2.59
Okay, the last function which we can use to wrangle our data is
if_else()
. This function is typically used to test
conditions in one variable (i.e., ‘this is statement true’), usually to
then create a new value based on whether the statement is
TRUE
or FALSE
. By way of example, the below
code chunk would be a base R way of writing an if else
statement. Essentially, we are testing the statement of whether x is
greater or less than a specified value. We then print different outputs
to the Console depending on whether it is true or not. Change the value
of x
to see what happens. You don’t have to understand this
code in detail, but rather, just focus on understanding the logic.
x <- 2
if (x > 5) {
print("Greater than five")
} else if (x <= 5) {
print("Equal to or less than five")
}
## [1] "Equal to or less than five"
Lucky for us, tidyverse
makes it a lot easier to write
such statements in conjunction with the pipe %>%
and
other verbs like mutate()
. But the logic is the same. In
the below example, we use the disorder data to create a new variable
which simply tells us whether the offence took place in a residence or
not. Note that, we nest if_else()
in
mutate()
.
disorder <- disorder %>%
mutate(residential = if_else(condition = location_category == "residence",
true = "residential location",
false = "non-residential location"))
We could then, for example, easily check frequency counts. You will
notice that our new variable is missing for incidents that already had a
missing value is location_category
.
count(disorder, residential)
## # A tibble: 3 x 2
## residential n
## <chr> <int>
## 1 non-residential location 4725
## 2 residential location 2412
## 3 <NA> 9636
location_category
variable. Use
if_else()
to create a new variable which states whether the
disorder incident occurred in one of these locations (or not). Note that
this will involve more than one condition.recode()
. Which one
do you prefer?tidyverse
, follow every Google
search with the word ‘tidyverse’ or the specific package within the
tidyverse
(e.g., dplyr
) if you know which one
it is.This material is based on both the materials and experience of teaching with colleagues from the Space Place and Crime working group and the Methods@Manchester summer school.