If you are a data scientist who wants to capture data from such web pages then you wouldn’t want to be the one to open all these pages manually and scrape the web pages one by one. To push away the boundaries limiting data scientists from accessing such data from web pages, there are packages available in R.

The more data you collect, the better your models, but what if the data you want resides on a website? This is the problem of social media analysis when the data comes from users posting content online and can be very unstructured. While there are some websites who support data collection from their web pages and have even exposed packages and APIs (such as Twitter), most of the web pages lack the capability and infrastructure for this. If you are a data scientist who wants to capture data from such web pages then you wouldn’t want to be the one to open all these pages manually and scrape the web pages one by one. To push away the boundaries limiting data scientists from accessing such data from web pages, there are packages available in R. They are based on a technique known as ‘Web scraping’ which is a method to convert the data, whether structured or unstructured, from HTML into a form on which analysis can be performed. Let us look into web scraping technique using R.

Harvest Data with “rvest”

Before diving into web scraping with R, one should know that this area is an advanced topic to begin working on in my opinion. It is absolutely necessary to have a working knowledge of R. Hadley Wickham authored the rvest package for web scraping using R which I will be demonstrating in this article.The package also requires ‘selectr’ and ‘xml2’ packages to be installed.  Let’s install the package and load it first.

#Installing the web scraping package rvest
install.packages("rvest")
library(rvest)

The way rvest works is straightforward and simple. Much like the way you and me manually scrape web pages, rvest requires identifying the webpage link as the first step. The pages are then read and appropriate tags need to be identified. We know that HTML language organizes its content using various tags and selectors. These selectors need to be identified and marked so that their content is stored by the rvest package. We can then convert all the scraped data into a data frame and perform our analysis. Let’s take an example of capturing the content from a blog page – the PGDBA wordpress blog for analytics. We will look at one of the pages from their experiences section. The link to the page is: http://pgdbablog.wordpress.com/2015/12/10/pre-semester-at-iim-calcutta/

As the first step mentioned earlier, I store the web address in a variable url and pass it to the read_html() function. The url is read into memory similar to the way we read csv files using read.csv() function.

#Specifying the url for desired website to be scrapped
url <- 'http://pgdbablog.wordpress.com/2015/12/10/pre-semester-at-iim-calcutta/'

#Reading the HTML code from the website
webpage <- read_html(url)

Not All Content on a Web Page is Gold – Identifying What to Scrape

Web scraping starts after the url has been read. However, a web page can contain a lot of content and we may not need everything. This is why web scraping is performed for targeted content. For this, we use the selector gadget. The selector gadget now has an extension in chrome and is used to pinpoint the names of the tags which we want to capture. If you don’t have the selector gadget and have not used it, you can read about it using the command in R. You can also install the gadget by going to the website http://selectorgadget.com/

#Know about the selector gadget
vignette("selectorgadget")

After installing the selector gadget, open the webpage and click on the content which you want to capture. Based on the content selected, the selector gadget generates the tag which was used to store it in HTML. The content can then be scraped by mentioning the tag (also known as CSS selector) in html_nodes() function and converting it into html_text. The sample code in R looks like this:

#Using the CSS selector (using ‘www.imdb.com’ website in this example)
rating_html=html_nodes(webpage,'.imdb-rating')   #’.imdb-rating’ is taken from CSS selector

#Converting the rating data to text
rating <- html_text(rating_html)

#Check the rating captured
rating

Simple! Isn’t it? Let’s take a step further and capture the content our target webpage!

Scraping Your First Webpage

I choose a blog page because it is all text and serves as a good starting example. Let’s begin by capturing the date on which the article was posted. Using the selector gadget, clicking on the date revealed that the tag required to get this data was .entry-date

#Using CSS selectors to scrap the post date
post_date_html <- html_nodes(webpage,'.entry-date')

#Converting the post date to text
post_date <- html_text(post_date_html)

#Verify the date captured
post_date

"December 10, 2015"

It’s an old post! The next step is to capture the headings. However, there are two headings here. One is the title of the article and other is the summary. Interestingly, both of them can be identified using the same tag. The beauty of rvest package comes here that it can capture both of the headings in one go. Let’s perform this step

#Using CSS selectors to scrap the title and title summary sections
title_summary_html <- html_nodes(webpage,'em')
 
#Converting the title data to text
title_summary <- html_text(title_summary_html)
 
#Check the title of the article
title_summary[2]
#Read the title summary of the article
title_summary[1]

The main title is stored as the second value in the title_summary vector. The first value contains the summary of the data. With this, the only section remaining is the main content. This is probably organized using the paragraph tag. We will use the ‘p’ tag to capture all of it.

#Using CSS selectors to scrap the blog content
content_data_html <- html_nodes(webpage,'p')
 
#Converting the blog content data to text
content_data <- html_text(content_data_html)
 
#Let's see how much content we have captured
length(content_data) #the output is 38

We see that the content_data has a length of 38. However, the website shows that there are only 11 paragraphs in the main content. Additional paragraphs which are captured are actually the comments, likes and other content after the main blog post. For our purposes, we will only read the first 11 values of the content data and not use the remaining text in our data frame.

#Reading the content of the article
content_data[1]
content_data[2]
content_data[3]
content_data[4]
content_data[5]
content_data[6]
content_data[7]
content_data[8]
content_data[9]
content_data[10]
content_data[11]

Since we have captured the comments section, let us see how many comments were made. The selector gadget helps us to know that the .fn tag can be used to note the names of people who commented on the article.

#Using CSS selectors to scrap the names of people who commented
comments_html <- html_nodes(webpage,'.fn')
 
#Converting the commenters to text
comments <- html_text(comments_html)
 
#Let's have a look at all the names
comments
#What are the total number of comments made?
length(comments) #8 comments
#How many different people made comments?
length(unique(comments)) #6 people

This is consistent with the article where Gautam Kumar, the author of the article and pgdbaunofficial, the page owner made multiple comments. We will now try to convert our data into data frame.

#convert all the data into a data frame
first_blog<-data.frame(Date = post_date, Title = title_summary[2],Description = title_summary[1], content=paste(content_data[1:11], collapse = ''), commenters=length(comments))
 
#Checking the structure of the data frame
str(first_blog) #all the features are factors and need to be converted into appropriate types

This is a simple data frame with only five columns – Date, title, description, content and number of commenters. As long as we remain on the same website, the same code can be suitably reused for all the articles. However, for a different website, we may need a different piece of code. Let’s try another blog from the same blog first. The link is https://pgdbablog.wordpress.com/2015/12/18/pgdba-chronicles-first-semester/

#Specifying the url for desired website to be scrapped
url <- 'http://pgdbablog.wordpress.com/2015/12/18/pgdba-chronicles-first-semester/'
 
#Reading the HTML code from the website
webpage <- read_html(url)
 
#Using CSS selectors to scrap the rankings section
post_date_html <- html_nodes(webpage,'.entry-date')
 
#Converting the ranking data to text
post_date <- html_text(post_date_html)
 
#Let's have a look at the rankings
post_date
 
#Using CSS selectors to scrap the title section
title_summary_html <- html_nodes(webpage,'em')
 
#Converting the title data to text
title_summary <- html_text(title_summary_html)
 
#Let's have a look at the title
title_summary[1]
title_summary[2]
title_summary[3]
title_summary[4]
title_summary[5]
title_summary[6]

This one has six titles – the first one is the summary, the next four are captions to the images and the last is the title heading. We can also capture the content and the comments similarly. Let’s try to capture the images which are new in this post. The selector gadget shows the images have tags from ‘.wp-image-51’ to ‘.wp-image-54’. Let’s download the last image. I am going to use an alternative approach where I set an html_session using the url

#Setting an html_session
webpage <- html_session(url)
 
#Getting the image using the tag
Image_link <- webpage %>% html_nodes(".wp-image-54")
 
#Fetch the url to the image
img.url <- Image_link[1] %>% html_attr("src")
 
#Save the image as a jpeg file in the working directory
download.file(img.url, "test.jpg", mode = "wb")

The Final Deed – Scraping Multiple Content

As the last webpage, we will move out of the blog and use a more content rich page. This time, we will capture the data from moneyball page in the imdb website. The link which we need is:http://www.imdb.com/title/tt1210166/

Imdb stores its content into well organized tags such as #titleDetails, #titleDidYouKnow, #titleCast, etc. This makes it easy to scrape the page by specifying whichever content we need. The cast is also displayed in the form of a table and we can use a table tag to capture the cast. Let’s capture the cast using the tag versus using the table tag.

#Scraping the cast using titleCast tag
cast_html =  html_nodes(webpage,"#titleCast .itemprop span")
#Convert to text
cast=html_text(cast_html)
#Let’s see our cast for the moneyball movie
cast
 
#scraping the cast using table tag
cast_table_html =  html_nodes(webpage,"table")
#Converting the cast to a table
cast_table=html_table(cast_table_html)
#Checking the first table on the website which represents the cast
cast_table[[1]]

We see that there are no major differences with the only one being that the cast_table is formatted in the form of a table because we used html_table function instead of html_tag function.

The Beginning of Web Scraping

We can do a lot with web scraping if we know the right way to do it. The rvest package makes it very easy to scrape pages and capture content in the form of data frames or files. Besides scraping blogs and rating websites, we can also automate mundane tasks such as scrape jobs from job websites or content from LinkedIn. Most of the tasks which are focussed on web scraping target getting data from the web pages and then using them for analysis. We can also scrape pages by using xml instead of html selectors as an alternative. In this case, the ‘table’ tag will become the //table xml tag and data can be scraped in a similar fashion. The data captured, when converted to data frames can be then used for analysis and get more knowledge about what is happening on social media today. In the end, the process remains the same – find the web page, identify the tags to be captured, convert them to text and store them into a data frame. I’m sure this article made web scraping easier than when you first started reading it.

Here is the code for the web scraping tasks used in this article


Submit a Comment

Your email address will not be published. Required fields are marked *