简体   繁体   English

如何使用readLines和grep在R中构建webscraper?

[英]how to build a webscraper in R using readLines and grep?

I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. 我是R的新手。我想编写一份100万字的报纸文章。 So I am trying to write a web scraper to retrieve newspaper articles from eg the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs . 因此,我正在尝试编写一个网络刮刀来检索来自监护人网站的报纸文章: http//www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs

The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. 刮刀用于从一页开始,检索文章的正文,删除所有标签并将其保存到文本文件中。 Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words. 然后它应该通过本页面上的链接转到下一篇文章,获取文章等,直到该文件包含大约100万字。

Unfortunately, I did not get very far with my scraper. 不幸的是,我的刮刀并没有走得太远。

I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code. 我使用readLines()来访问网站的源代码,现在想要获取代码中的相关行。

The relevant section in the Guardian uses this id to mark the body text of the article: Guardian中的相关部分使用此ID来标记文章的正文:

<div id="article-body-blocks">         
  <p>
    <a href="http://www.guardian.co.uk/politics/boris"
       title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
       the...a different approach."
  </p>
</div>

I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. 我尝试使用grep和lookbehind的各种表达式来掌握这一部分 - 尝试获取此ID后面的行 - 但我认为它不适用于多行。 At least I cannot get it to work. 至少我不能让它发挥作用。

Could anybody help out? 有人可以帮忙吗? It would be great if somebody could provide me with some code I can continue working on! 如果有人可以提供一些我可以继续工作的代码,那将是很棒的!

Thanks. 谢谢。

You will face the problem of cleaning of the scraped page if you really insist on using grep and readLines , but this can be done of course. 如果您真的坚持使用grepreadLines ,那么您将面临清理已删除页面的问题,但这当然可以完成。 Eg.: 例如。:

Load the page: 加载页面:

html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')

And with the help of str_extract from stringr package and a simple regular expression you are done: stringr包的str_extract和简单的正则表达式的帮助下,你完成了:

library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')

Well, body looks ugly, you will have to clean it up from <p> and scripts also. 好吧, body看起来很难看,你必须从<p>和脚本中清理它。 This can be done with gsub and friends (nice regular expressions). 这可以通过gsub和朋友(很好的正则表达式)来完成。 For example: 例如:

gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)

As @Andrie suggested, you should rather use some packages build for this purpose. 正如@Andrie建议的那样,你应该使用一些为此目的构建的包。 Small demo: 小演示:

library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)

Where body results in a clean text: body导致文本干净的地方:

> str(body)
 chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...

Update :The above as a one-liner (thanks to @Martin Morgan for suggestion): 更新 :以上作为单行(感谢@Martin Morgan的建议):

xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM