简体   繁体   中英

how to build a webscraper in R using readLines and grep?

I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. So I am trying to write a web scraper to retrieve newspaper articles from eg the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs .

The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words.

Unfortunately, I did not get very far with my scraper.

I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code.

The relevant section in the Guardian uses this id to mark the body text of the article:

<div id="article-body-blocks">         
  <p>
    <a href="http://www.guardian.co.uk/politics/boris"
       title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
       the...a different approach."
  </p>
</div>

I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. At least I cannot get it to work.

Could anybody help out? It would be great if somebody could provide me with some code I can continue working on!

Thanks.

You will face the problem of cleaning of the scraped page if you really insist on using grep and readLines , but this can be done of course. Eg.:

Load the page:

html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')

And with the help of str_extract from stringr package and a simple regular expression you are done:

library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')

Well, body looks ugly, you will have to clean it up from <p> and scripts also. This can be done with gsub and friends (nice regular expressions). For example:

gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)

As @Andrie suggested, you should rather use some packages build for this purpose. Small demo:

library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)

Where body results in a clean text:

> str(body)
 chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...

Update :The above as a one-liner (thanks to @Martin Morgan for suggestion):

xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM