Web scraping

Requirements:

Basic knowledge in R
The following packages installed: rvest, dplyr

We will cover the following:

What is web scraping and what can we use it for?
HTML basics
Scraping HTML data in R

1. What is Web scraping and what can we use it for?

Web scraping refers to the extraction of data from a website. While this can be done manually, using R code can automate the process.

This is not always straightforward; websites are designed differently, and web scrapers need to be tailored to those specific designs. Sometimes, websites are built to block these attempts, for example using some form of human verification. To tailor to this, web scrapers vary in complexity, from taking a snapshot of a URL at a specific moment in time, to mimicking human interaction with a website by clicking buttons and inserting input (called crawling, for more information: https://www.zenrows.com/blog/web-scraping-r#is-r-good-for-web-scraping).

In this tutorial, I will present a way to scrape data from a website by taking a snapshot of the HTML code and extracting interesting information from there.

Say that we are interested in creating a table describing all the NSC-R meetings. We can achieve it by scraping the data from this page: https://nscrweb.netlify.app/blog.html.

2. HTML basics

Web sites are created with HTML code that your browsers renders. We can inspect that HTML code to access more information about different elements in a page. For example, try to hover over a single meeting, right-click, and select “inspect”. What you see is the html code behind the visual display of the page.

This page is constructed, like many others, by a head, which we will ignore for now, and a body. An HTML page contains nodes, including comments and elements, which are constructed in a hierarchical way, like a tree. If an element contains any other elements, those are its children; if it is inside another element, that is its parent; if there are other elements within the same parent element, those are its siblings.

If we inspect specific elements in the page, we notice that they contain their own hierarchical structure including multiple tags. A tag begins with <tag> and ends with </tag>. Everything between those two markers is the tag’s content.

The first string in each tag marks the type of a tag, which can include a link (a tag), a header (h1/h2/h3… tags), a division of text (div tag), etc. The tags also include other arguments named attributes including the classification or identifier of each tag, which we will need to know in order to specify the location of the information we wish to extract. If any text appears in a tag, which is not another tag or a comment, it is referred to as text.

Here is an example of HTML tag encompassing a single meeting:

<a href="posts/2023-06-02-walk-in-hour/" class="post-preview">
  <script class="post-metadata" type="text/json">{"categories":[]}</script>
  <div class="metadata">
    <div class="publishedDate">June 2, 2023</div>
    <div class="dt-authors">
      <div class="dt-author">NSC-R Workshops Team</div>
    </div>
  </div>
  <div class="thumbnail">
    <img>
  </div>
  <div class="description">
    <h2>NSC-R Walk-In Hour</h2>
    <div class="dt-tags"></div>
    <p>Drop by and meet our team to discuss any R-related issues</p>
  </div>
</a>
</html>}

3. HTML Scraping with R

Now that we know the hierarchical construction of the HTML file, we can download it and extract the relevant data.

link <- "https://nscrweb.netlify.app/blog.html"

page <- read_html(link)

page

## {html_document}
## <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="layout-listing">\r\n\r\n<!--radix_placeholder_front_matter-- ...

We know from inspecting the page that the meetings are displayed on a list, under a div tag with class “posts-list”. Let’s extract that list from the HTML file and save it into an object.

There are two ways to do this:

A CSS selector, which can be simple but rigid in my experience (default).
An XPATH identifier, which can be more complex but more versatile.

# A CSS selector 
post_list_css <- page |>
  html_element(css = ".posts-list")
post_list_css

## {html_node}
## <div class="posts-list">
##  [1] <h1 class="posts-list-caption" data-caption="Workshops">Workshops</h1>
##  [2] <a href="posts/2023-06-02-walk-in-hour/" class="post-preview">\r\n<scrip ...
##  [3] <a href="posts/2023-05-23-webscraping/" class="post-preview">\r\n<script ...
##  [4] <a href="posts/2023-05-09-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
##  [5] <a href="posts/2023-05-04-walk-in-hour/" class="post-preview">\r\n<scrip ...
##  [6] <a href="posts/2023-04-25-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
##  [7] <a href="posts/2023-04-21-walk-in-hour/" class="post-preview">\r\n<scrip ...
##  [8] <a href="posts/2023-04-11-quarto/" class="post-preview">\r\n<script clas ...
##  [9] <a href="posts/2023-03-14-getting-help/" class="post-preview">\r\n<scrip ...
## [10] <a href="posts/2023-02-21-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## [11] <a href="posts/2023-01-24-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## [12] <a href="posts/2022-12-06-simulation/" class="post-preview">\r\n<script  ...
## [13] <a href="posts/2022-11-29-search-term-selection-systematic-reviews/" cla ...
## [14] <a href="posts/2022-11-15-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## [15] <a href="posts/2022-11-01-conjunctive-analysis/" class="post-preview">\r ...
## [16] <a href="posts/2022-10-05-installing-rstudio-and-r/" class="post-preview ...
## [17] <a href="posts/2022-10-05-using-r-for-reproducible-social-science/" clas ...
## [18] <a href="posts/2022-07-08-summer-break/" class="post-preview">\r\n<scrip ...
## [19] <a href="posts/2022-06-09-writing-r-packages/" class="post-preview">\r\n ...
## [20] <a href="posts/2022-05-31-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## ...

# An xpath identifier 
post_list_xpath <- page |>
  html_element(xpath = "//div[@class = 'posts-list']")

post_list_xpath

## {html_node}
## <div class="posts-list">
##  [1] <h1 class="posts-list-caption" data-caption="Workshops">Workshops</h1>
##  [2] <a href="posts/2023-06-02-walk-in-hour/" class="post-preview">\r\n<scrip ...
##  [3] <a href="posts/2023-05-23-webscraping/" class="post-preview">\r\n<script ...
##  [4] <a href="posts/2023-05-09-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
##  [5] <a href="posts/2023-05-04-walk-in-hour/" class="post-preview">\r\n<scrip ...
##  [6] <a href="posts/2023-04-25-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
##  [7] <a href="posts/2023-04-21-walk-in-hour/" class="post-preview">\r\n<scrip ...
##  [8] <a href="posts/2023-04-11-quarto/" class="post-preview">\r\n<script clas ...
##  [9] <a href="posts/2023-03-14-getting-help/" class="post-preview">\r\n<scrip ...
## [10] <a href="posts/2023-02-21-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## [11] <a href="posts/2023-01-24-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## [12] <a href="posts/2022-12-06-simulation/" class="post-preview">\r\n<script  ...
## [13] <a href="posts/2022-11-29-search-term-selection-systematic-reviews/" cla ...
## [14] <a href="posts/2022-11-15-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## [15] <a href="posts/2022-11-01-conjunctive-analysis/" class="post-preview">\r ...
## [16] <a href="posts/2022-10-05-installing-rstudio-and-r/" class="post-preview ...
## [17] <a href="posts/2022-10-05-using-r-for-reproducible-social-science/" clas ...
## [18] <a href="posts/2022-07-08-summer-break/" class="post-preview">\r\n<scrip ...
## [19] <a href="posts/2022-06-09-writing-r-packages/" class="post-preview">\r\n ...
## [20] <a href="posts/2022-05-31-nsc-r-tidy-tuesday/" class="post-preview">\r\n ...
## ...

In general, what we want to specify is the type of tag, the type of attribute, and the attribute value. We can add or subtract from this specification to refer to more specific or more general elements.

For CSS, using “.” to refer to a child element and then the value of an attribute can be enough (providing you are referring to the correct tag). For XPATH, some more information is needed. A description of how each works is beyond the scope of this meeting, but use these resources for XPATH (https://devhints.io/xpath, http://xpather.com/) and this (https://www.scrapingbee.com/blog/using-css-selectors-for-web-scraping/) for CSS.

Now that we have the list of posts, we want to be able to refer to specific posts. In our case, it’s the a tags with class identified as “post-preview”. The post list objects still refer to a list of posts within a div parent tag. Let’s get a list of all a tags which parent each post:

# The first refers to the header of the list, let's get rid of that
post_list <- post_list_css |>
  html_elements("a")

Note that we could use html_elements to create a list of appropriate elements, or html_element to refer to a single or the first match.

Now that we have a nice list of posts, we can dive into a single post and extract the data we want. Let’s look again at the HTML snippet for a single post:

<a href="posts/2023-06-02-walk-in-hour/" class="post-preview">
  <script class="post-metadata" type="text/json">{"categories":[]}</script>
  <div class="metadata">
    <div class="publishedDate">June 2, 2023</div>
    <div class="dt-authors">
      <div class="dt-author">NSC-R Workshops Team</div>
    </div>
  </div>
  <div class="thumbnail">
    <img>
  </div>
  <div class="description">
    <h2>NSC-R Walk-In Hour</h2>
    <div class="dt-tags"></div>
    <p>Drop by and meet our team to discuss any R-related issues</p>
  </div>
</a>
</html>}

For each post, we can extract:

Post title
Author
Description
Date
Link to meeting

Let’s use the first post to search for this data.

first_post <- post_list[[1]]
first_post

## {html_node}
## <a href="posts/2023-06-02-walk-in-hour/" class="post-preview">
## [1] <script class="post-metadata" type="text/json">{"categories":[]}</script>
## [2] <div class="metadata">\r\n<div class="publishedDate">June 2, 2023</div>\r ...
## [3] <div class="thumbnail">\r\n<img>\n</div>
## [4] <div class="description">\r\n<h2>NSC-R Walk-In Hour</h2>\r\n<div class="d ...

# Extract the text of the title is in a h2 tag under the tag with value "description":
first_post |>
  html_element(".description") |>
  html_element("h2") |>
  html_text()

## [1] "NSC-R Walk-In Hour"

# The description of the meeting is in a p tag under description
first_post |>
  html_element(".description") |>
  html_element("p") |>
  html_text()

## [1] "Drop by and meet our team to discuss any R-related issues"

Luckily, this page’s html code is written in a very straightforward way. Sometimes, however, the code can be very messy and elements can be repetitive. We could be more specific in the way we refer to an html tag with the “xpath” argument instead of the defaulted css argument we have been using. This way, we can be much more versatile when referring to a tag:

# The date is under the div tag with class publishedDate, so we can refer to "Date" as a partial class value:
first_post |>
  html_element(xpath = ".//div[contains(@class, 'Date')]") |>
  html_text()

## [1] "June 2, 2023"

# The meeting leader is under the div tag with class dt-author:
first_post |>
  html_element(xpath = ".//div[@class = 'dt-author']") |>
  html_text()

## [1] "NSC-R Workshops Team"

We can also access information within a tag (also known as attributes) rather than the text it contains:

# The link to the meeting information is in the a tag.
# We can access the value of attribute "href" with the following code:
first_post  |>
  html_attr(name = "href")

## [1] "posts/2023-06-02-walk-in-hour/"

Now that we found out how to find all data, let’s save it into a dataframe.

nsc_r_meetings <- data.frame(title = character(0),
                             author = character(0),
                             date = character(0),
                             description = character(0),
                             href = character(0))

for (post in post_list) {
  title <- post |> html_element(".description") |>
    html_element("h2") |>
    html_text()
  
  description <- post |>
    html_element(".description") |>
    html_element("p") |>
    html_text()
  
  date <- post |>
    html_element(xpath = ".//div[contains(@class, 'Date')]") |>
    html_text()
  
  author <- post |>
    html_element(xpath = ".//div[@class = 'dt-author']") |>
    html_text()
  
  href <- post  |>
            html_attr(name = "href") 
  
  nsc_r_meetings <- nsc_r_meetings |>
    add_row(
      title = title,
      author = author,
      date = date,
      description = description,
      href = href
    )
}

nsc_r_meetings |>
  head() |>
  knitr::kable(caption = "NSC-R Meetings")

NSC-R Meetings
title	author	date	description	href
NSC-R Walk-In Hour	NSC-R Workshops Team	June 2, 2023	Drop by and meet our team to discuss any R-related issues	posts/2023-06-02-walk-in-hour/
Web scraping	Danielle van Westbroek-Stibbe	May 22, 2023	Use R to extract data from internet.	posts/2023-05-23-webscraping/
NSC-R Tidy Tuesday	Wim Bernasco	May 9, 2023	Improve your R skills in entertaining, inspiring and supportive sessions moderated by your own colleagues.	posts/2023-05-09-nsc-r-tidy-tuesday/
NSC-R Walk-In Hour	NSC-R Workshops Team	May 4, 2023	Drop by and meet our team to discuss any R-related issues	posts/2023-05-04-walk-in-hour/
NSC-R Tidy Tuesday	Asier Moneva	April 25, 2023	Improve your R skills in entertaining, inspiring and supportive sessions moderated by your own colleagues.	posts/2023-04-25-nsc-r-tidy-tuesday/
NSC-R Walk-In Hour	NSC-R Workshops Team	April 21, 2023	Drop by and meet our team to discuss any R-related issues	posts/2023-04-21-walk-in-hour/

Web scraping

Danielle van Westbroek-Stibbe

2023-05-16

Web scraping

1. What is Web scraping and what can we use it for?

2. HTML basics

3. HTML Scraping with R

After data data is extracted, you can begin with data cleaning and manipulation. Good luck!