Web Scraping

Overview	Software
Description	Websites
Readings	Courses

Overview

What if you had an idea for an ecological study, but the data you needed wasn’t available to you? What if you wanted to validate one of your measures by comparing your estimates to external sources? What do you do?

Well, for one, you could go and get the data online. Web scraping (web harvesting or web data extraction) is a computer software technique that allows you to extract information from websites. When you want to extract data from a document, you would copy and paste the elements you want. For a website, this is a little trickier because of the way the information is formatted and stored, typically as HTML code. Thus, scrapers work by parsing the HTML source code of a website in order to extract and retrieve specific elements within the page’s code.

Description

Search engines use a specific type of scraper, called a web crawler or search bot, to crawl through web pages and identify which sites they link to and what terms they use. This could mean the first web scrapers were around in the early nineties.

Google and Facebook really brought scraping to another level. Google scraped the web to catalogue all of the information on the internet and make it accessible. Recently, Facebook has been using scrapers to help people find connections and fill out their social networks.

Legality

Well, that depends on what you think the meaning of “legality” is. While early century court precedents set the tone for unscrupulous scraping of content, recent rulings have shifted towards a more conservative approach. Generally, if you have to agree to terms of consent, if the data is available for purchase, or if the data is behind a login, you are treading in a legal murky area. Even if none of these caveats are met, you might still be in hot water.

Ethics

Here are some general ethical issues to consider prior to scraping:

1) Respect the hosting site’s wishes

Some websites may have instructions for bots and scrapers, outlining the elements that can be scraped and which elements are off limits. These sites have robot.txt files that disallow scraping of particular content. Also, if you have to agree to any terms and conditions, be sure to read them thoroughly. Check if an API exists or if the data is otherwise available for download or sale.

2) Respect the hosting site’s bandwidth

Hosting websites costs money, and scraping takes up bandwidth. If you are familiar with Denial-of-service-attacks, scraping or sending bots to a website is similar. Write responsible programs that limit bandwidth use. Wait a few seconds between requests, and try to scrape during off-peak hours. Finally, scrape only what you need.

3) Respect the law

Some call it theft; some call it legitimate business practice. The fact that you can access the data doesn’t mean you can use it for your research. Some data is more sensitive. In particular, time sensitive data is popular. For instance, a successful bookmaker may want to have their lines listed to the betting public, but they obviously wouldn’t want their competitors to know. Read the terms of agreement if applicable, or just be more subversive.

Example Application

The following is a brief example of scraping data of one bedroom apartment listings in Manhattan using R. This code can easily be adapted for other apartment size, location, and other amenities by setting a different search filter on Naked Apartments and pasting the updated URL below.

1) Get the webpage URL

url <- “http://www.nakedapartments.com/renter/listings/search?nids=23,211,6,21,203,191,194,18,24,76,204,205,10,14,195,1,5,25,93,206,22,17,207,13,155,16,72,2,9,20,19,73,7,208,209,192,8,74,210,11,4,3,26,212,12&aids=3&order=asc&sort=rent&page=”

# set the maximum number of search result pages. Currently set at 800.

s <- as.character(seq(1,800,by=1))
urls <- paste0(url, s)

2) Scrape the lines of code

# load the libraries

require(RCurl)
require(XML)
library(stringr)

SOURCE <- getURL(urls,encoding=”UTF-8″) # Specify encoding when dealing with non-latin characters

3) Parse the HTML code to isolate the data

PARSED <- htmlParse(SOURCE)

# price and neighborhood

listings <- (xpathSApply(PARSED, “[PATH]”, xmlValue))

# trim white space

listings <- str_trim(listings)
listings <- strsplit(listings, “, “)
tabs <- matrix(unlist(listings), , 2, byrow=TRUE)
colnames(tabs) <- cbind(“price”, “neighborhood”)

# lat and long

lat <- (xpathSApply(PARSED, “div[@id]/@data-latitude”))
long <- (xpathSApply(PARSED, “div[@id]/@data-longitude”))
tabs1 <- cbind(tabs, lat, long)
row.names(tabs1) <- seq(nrow(tabs1))

4) Clean and put elements into a dataframe

mydf <- data.frame(tabs1)
lats <- as.numeric(tabs1[,3])
longs <- as.numeric(tabs1[,4])

lats[lats==0] <- NA
longs[longs==0] <- NA

mydf[,3] <- lats
mydf[,4] <- longs

price <- mydf[,1]
price1 <- gsub(“$”, “”, as.character(price), fixed=TRUE)
price2 <- gsub(“,”, “”, as.character(price1), fixed=TRUE)
price3 <- as.numeric(price2)
mydf[,1] <- price3
head(mydf)

NEW <- mydf[complete.cases(mydf),]
table(complete.cases(NEW))

dat <- tapply(NEW$price, NEW$neighborhood, mean)
p <- as.matrix(dat)
p

p[order(p[,1]),]

Readings

Textbooks & Chapters

HANRETTY, C. 2013. Scraping the web for arts and humanities.

Articles

NAN, X. Web scraping with R. In: ROAD2STAT, ed. 6th China R 2013 Beijing.

LEE, B. K. 2010. Epidemiologic research and Web 2.0–the user-driven Web. Epidemiology, 21,760-3.

SIGNORINI, A., SEGRE, A. M. & POLGREEN, P. M. 2011. The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS One,6, e19467.

CUNNINGHAM, J. A. 2012. Using Twitter to measure behavior patterns. Epidemiology, 23, 764-5.

CHEW, C. & EYSENBACH, G. 2010. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLoS One, 5, e14118.

[On ethics: Screen scraping: how to profit from your rival’s data]
http://www.bbc.co.uk/news/technology-23988890

[On ethics: Depends on what the meaning of the word “illegal” means]
http://www.distilnetworks.com/is-web-scraping-illegal-depends-on-what-the-meaning-of-the-word-is-is

[On ethics – Felony charges for screen scraper]
http://www.forbes.com/sites/andygreenberg/2012/11/21/security-researchers-cry-foul-over-conviction-of-att-ipad-hacker/

[Programming with humanists: Reflections on raising an army of hacker-scholars]
http://blog.hartleybrody.com/web-scraping/http://openbookpublishers.com/htmlreader/DHP/chap09.html#ch09

Websites

[Charles DiMaggio on Web Scraping]
http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/styled-4/styled-6/code-13/

[Web scraping basics – Part I of III]
http://www.r-bloggers.com/web-scraping-in-r/

[Scraping Google Scholar]
http://www.r-bloggers.com/web-scraper-for-google-scholar-updated

[How to buy a used car with R]
http://www.r-bloggers.com/web-scraper-for-google-scholar-updated

[Commercial website for scrapers]
https://scraperwiki.com/

[Commercial website for scraped data]
http://scrapy.org/

Courses

BARBERA, P. NYU Politics Data Lab Workshop: Scraping Twitter and Web Data using R. Department of Politics, 2013 New York University

STARKWEATHER, J. 2013. Five easy steps for scraping data from web pages. Benchmarks RSS Matters.

Join the Conversation

Have a question about methods? Join us on Facebook

JOIN