Scraping with RVest

Wouter van Atteveldt

Web Scraping

  • Web Scraping: principles and challenges
  • Scraping with rvest
  • Lab: scraping wikipedia II

Web Scraping

  1. Request web pages
    • Follow navigation structure
    • Follow links / Fill in forms / build URL
    • Possibly requires login
  2. Interpret resulting HTML
    • Semi-structured data
    • Need to parse (interpret) page structure
    • Extract required information
  3. Store/process structured information

Web Scraping: Tools

  • Barebones: HTTP requests and string parsing
  • HTML libraries (rvest, lxml, beautifulsoup)
  • APIs
  • API client libraries (e.g. RFacebook)

Web Scraping: Legal issues

  • I am not a lawyer!
    • Search legal advice, esp. commercial use
  • Is scraping allowed?
    • Authorized public access
    • obey robots.txt
    • don't hammer server
    • Terms of use enforcable?
    • Login required?
  • Is storing allowed?
    • Copyright law
    • Facts, processed information probably OK
    • Don't publish any texts, extracted info

Scraping with rvest

  • Web Scraping: principles and challenges
  • Scraping with rvest
    • CSS Selectors
    • Extracting information
    • Forms and links
    • Logging in
  • Lab: scraping wikipedia II

Why rvest?

  • Interpreting HTML as text is hard
    • optional elements, variations/errors possible
      • <a target="_blank" href="...">
      • <a href=site.html><p>link</a>
    • Finding nested information really hard
      • Especially if HTML contains errors, e.g. unbalanced tags
  • Search at level of node structure

HTML as tree structure

<html>
  <head>
    <style>...</style>
  </head>
  <body>
    <h1>This is the head</h1>
    <p>This is <a href="..">a link</a></p>
  </body>
</html

HTML as tree structure

  • html
    • head
      • style
    • body
      • h1
      • p
      • a [href=..]

RVest

  • Easy web scraping
    • Uses httr and xml2
  • Search structure rather than raw html
install.packages("rvest")
library(rvest)
test = read_html("test.html")
paras = html_nodes(test, "p")
links = html_nodes(paras, "a")
html_attr(links, "href")
[1] "test2.html"

Aside: magrittr

Magritte

Pipeline notation in RVest

paras = html_nodes(test, "p")
links = html_nodes(paras, "a")
html_attr(links, "href")
[1] "test2.html"
  • Pipeline notation:
    • x %>% f(y) is same as f(x, y)
    • Simplifies chained function calls
test %>% html_nodes("p") %>% 
  html_nodes("a") %>% html_attr("href")
[1] "test2.html"

Using RVest

  • Search for elements
    • html_nodes, html_node
    • Use CSS Selectors
  • Fill in forms, follow links
    • html_form, submit_form
    • follow_link
  • Extract information
    • html_attr, html_text, html_table

CSS Selectors

  • Web Scraping: principles and challenges
  • Scraping with rvest
    • Introduction
    • CSS Selectors
    • Forms and links
    • Logging in
  • Lab: scraping wikipedia II

CSS and HTML

  • HTML has limited set of tags
  • Cascading Style Sheets (CSS)
    • Specify style (font, color, border, placement)
    • Based on structure, tags, HTML classses
  • Allows for 'semantic markup'
    • HTML specifies structure, CSS layout

CSS Structure

<selector> {<attr>: <val>}
  • Selectors select groups of nodes:
    • Tag name: p {..}
    • Class: .main, p.main {..}
    • ID: #story, p#story {..}
    • Structure
      • direct child: p > a {..}
      • indirect child: p a {..}
  • Combinations
    • p.body a
    • #mainbody .title > a

Let's play a game!

Using CSS in rvest

  • html_node uses css selectors
  • Make selection as complex as you want
  • Select in whole doc or within other nodes
  • E.g.: test_css.html

Using CSS in rvest

test2 = read_html("test_css.html")
test2 %>% html_nodes("a") %>% html_attr("href")
[1] "title.html" "test.html"  "legal.html"
test2 %>% html_nodes(".main a") %>% html_attr("href")
[1] "title.html" "test.html" 
test2 %>% html_nodes(".main") %>% html_nodes("a") %>% html_attr("href")
[1] "title.html" "test.html" 
test2 %>% html_nodes(".main p a") %>% html_attr("href")
[1] "test.html"

XPath as alternative

  • Xpath is general xml query language
  • Uses xml structure (not CSS semantics)
  • Less convenient, but more powerful
  • Use file-system like paths:
    • //h2: h2 anywhere in file
    • //p/a: a directly under any p
    • ./p': p as direct child of current node

XPath: axes

  • Can also look at siblings, ancestors, etc.
  • syntax: axis::nodetest[attributes]
  • Useful axes:
    • ancestor
    • parent
    • sibling

XPath: example

Get all text under 'Education' subheader

url = "https://en.wikipedia.org/wiki/Hong_Kong"
s = read_html(url)
headers = s %>% html_nodes("h3") 
hist = headers[headers %>% html_text == "Education"]
text = hist %>% html_nodes(xpath="following-sibling::p") %>% html_text
text[1]
[1] "Hong Kong's education system used to roughly follow the system in England,[238] although international systems exist. The government maintains a policy of \"mother tongue instruction\" (Chinese: 母語教學) in which the medium of instruction is Cantonese,[239] with written Chinese and English, while some of the schools are using English as the teaching language. In secondary schools, 'biliterate and trilingual' proficiency is emphasised, and Mandarin-language education has been increasing.[240] The Programme for International Student Assessment ranked Hong Kong's education system as the second best in the world.[241]"

Useful resources

Extracting information from HTML

  • Web Scraping: principles and challenges
  • Scraping with rvest
    • CSS Selectors
    • Extracting Information
    • Forms and links
    • Logging in
  • Lab: scraping wikipedia II

Extracting information from HTML

  • html_name: name of tag(s)
  • html_attr(attr): specific attribute (e.g. href)
  • html_attrs: all attributes
  • html_text: the (plain) text

Extracting information from HTML

test2 %>% html_nodes("h1") %>% html_text
[1] "This is the head"
test2 %>% html_nodes(".main > *") %>% html_name
[1] "h1" "p" 
test2 %>% html_nodes("a") %>% html_attr("href")
[1] "title.html" "test.html"  "legal.html"
test2 %>% html_nodes(".footer a") %>% html_attrs
[[1]]
        href 
"legal.html" 

Extracting tabular info

t = read_html("http://i.amcat.nl/test/test_table.html")
tab = t %>% html_node("table") %>% html_table
class(tab)
[1] "data.frame"
head(tab)
  ID Name Gender
1  1 John      M
2  2 Mary      F

Show html structure

html_structure(t)
<html>
  <body>
    {text}
    <div.main>
      {text}
      <h1>
        {text}
      {text}
      <table [border]>
        <tr>
          <th>
            {text}
          {text}
          <th>
            {text}
          {text}
          <th>
            {text}
          {text}
        <tr>
          <td>
            {text}
          {text}
          <td>
            {text}
          {text}
          <td>
            {text}
          {text}
        <tr>
          <td>
            {text}
          {text}
          <td>
            {text}
          {text}
          <td>
            {text}
          {text}
    {text}

Write HTML to file (useful for debugging)

write_html(t, file="/tmp/test.html")
system("head /tmp/test.html")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
    <div class="main">
      <h1>This is a simple table:</h1>
      <table border="1">
<tr>
<th>ID</th>
          <th>Name</th>
          <th>Gender</th>
        </tr>

Following links and Forms

  • Web Scraping: principles and challenges
  • Scraping with rvest
    • CSS Selectors
    • Extracting Information
    • Forms and links
    • Logging in
  • Lab: scraping wikipedia II

Submitting forms from R

search_url = "https://en.wikipedia.org/w/index.php?title=Special:Search"
session = html_session(search_url)
form = session %>% html_node("#search") %>% html_form
form = set_values(form, search="obama")
resp = submit_form(session, form=form)

head(resp %>% html_nodes(".mw-search-result-heading") %>% html_text)
[1] "Barack Obama    "               "Family of Barack Obama    "    
[3] "Michelle Obama    "             "Barack Obama Sr.    "          
[5] "Presidency of Barack Obama    " "Crush on Obama    "            
head(resp %>% html_nodes(".mw-search-result-heading a") %>% html_attr("href"))
[1] "/wiki/Barack_Obama"               "/wiki/Family_of_Barack_Obama"    
[3] "/wiki/Michelle_Obama"             "/wiki/Barack_Obama_Sr."          
[5] "/wiki/Presidency_of_Barack_Obama" "/wiki/Crush_on_Obama"            

Following links

r2 = follow_link(resp, i="Michelle Obama")
r2 %>% html_nodes("h1,h2") %>% html_text
 [1] "Michelle Obama"                   "Contents"                        
 [3] "Family and education"             "Career"                          
 [5] "Barack Obama political campaigns" "First Lady of the United States" 
 [7] "References"                       "Further reading"                 
 [9] "External links"                   "Navigation menu"                 
r2 = follow_link(resp, css=".mw-search-result-heading a")
r2 %>% html_nodes("h1,h2") %>% html_text
 [1] "Barack Obama"           "Contents"              
 [3] "Early life and career"  "Presidential campaigns"
 [5] "Presidency (2009–2017)" "Post-presidency"       
 [7] "See also"               "Notes and references"  
 [9] "External links"         "Navigation menu"       
r2 = follow_link(resp, i="next 20")
r2 %>% html_nodes(".mw-search-result-heading") %>% html_text
 [1] "Barack Obama on social media    "                                
 [2] "Inauguration of Barack Obama    "                                
 [3] "Republican and conservative support for Barack Obama in 2008    "
 [4] "The Obama Nation    "                                            
 [5] "Barack Obama presidential campaign, 2012    "                    
 [6] "Early life and career of Barack Obama    "                       
 [7] "Barack Obama on mass surveillance    "                           
 [8] "Assassination threats against Barack Obama    "                  
 [9] "Barack Obama citizenship conspiracy theories    "                
[10] "Foreign policy of the Barack Obama administration    "           
[11] "Timeline of the presidency of Barack Obama    "                  
[12] "Barack Obama: Der schwarze Kennedy    "                          
[13] "Hope! – Das Obama Musical    "                                   
[14] "Barack Obama \"Joker\" poster    "                               
[15] "First inauguration of Barack Obama    "                          
[16] "Obama Anak Menteng    "                                          
[17] "Second inauguration of Barack Obama    "                         
[18] "List of things named after Barack Obama    "                     
[19] "Efforts to impeach Barack Obama    "                             
[20] "Obama Wins!    "                                                 

Iterating over search results

  • Option 1: 'click' next until no more results
  • Option 2: Manually build search result
    • Find out number of pages / results
    • Create for loop over pages
    • Process each page

Iterating over search results: # of pages

q = '"City University of Hong Kong"'
session = html_session(search_url)
form = session %>% html_node("#search") %>% html_form %>% set_values(search=q)
r = submit_form(session, form)
info = r %>% html_nodes(".results-info strong") %>% html_text
i = as.integer(info[2])
i
[1] 436

Iterating over search results: create urls

q = RCurl::curlEscape(q)
maxpage = floor(i / 100)
offsets = (0:maxpage) * 100
template = "https://en.wikipedia.org/w/index.php?title=Special:Search&limit=100&offset=%i&search=%s"
urls = NULL
for(offset in offsets) {
  url = sprintf(template, offset, q)
  message("Offset:", offset, "; url:", url)
  results = read_html(url) %>% html_nodes(".mw-search-result-heading a") 
  links = results %>% html_attr("href")
  urls = c(urls, links)
}
length(urls)
urls[1:5]

Login required?

  • Web Scraping: principles and challenges
  • Scraping with rvest
    • CSS Selectors
    • Extracting Information
    • Forms and links
    • Logging in
  • Lab: scraping wikipedia II

Logging in to sites

  • Some sites require registration/login
  • Need to submit login form
  • Request other URLs within that session

Example: Scraping github

s = html_session("https://github.com/login")
form = html_form(s)[[1]]
form = set_values(form, login="vanatteveldt", password=password)
s = submit_form(s, form)

r = jump_to(s, "https://github.com/settings/emails")
emails = r %>% html_nodes("ul#settings-emails li") %>% html_text

# remember? :)
stringi::stri_extract_first(emails, regex="[\\w\\.]+@[\\w\\.]+")
[1] "wouter@vanatteveldt.com" "w.h.van.atteveldt@vu.nl"

Logging in to sites: Problems

  • Sites can make it difficult to login (e.g. WSJ.com, LexisNexis)
    • Can always be circumvented, but can be difficult
  • On registering, you agreed to their terms
    • This can include a ban on scraping
    • Always check legal issues first!
  • Often, API is better alternative (if offered)

Following links and Forms

  • Web Scraping: principles and challenges
  • Scraping with rvest
  • Lab: scraping wikipedia II

Lab: Scraping wikipedia II

  • Select an overview page:
    • that has a table with information
    • which links to detailed pages
    • e.g. list of cities in China
  • Extract information from the table into a dataframe
  • Follow links on all (or top 10) items
  • Add extra information to dataframe
    • e.g. population, description
  • Include (head of) table in report

Tables in Rmd files

  • Default tables are ugly
  • Use knitr::kable:
t = data.frame(id=1:3, names=c("John","Mary","Pete"))
knitr::kable(t)
id names
1 John
2 Mary
3 Pete

Adding information to data frame

t = data.frame(id=11:13, names=c("John","Mary","Pete"))
for (i in seq_along(t$id)) {
  # get existing values
  id = t$id[i]
  # compute new value
  code = sprintf("code for %i", id)
  # store new value
  t$code[i] = code
}
knitr::kable(t)
id names code
11 John code for 11
12 Mary code for 12
13 Pete code for 13