如何在 R 中使用 getURL() 优化抓取

Question

I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament.我正试图从法国下议院网站上的两个页面上删除所有法案。 The pages cover 2002-2012 and represent less than 1,000 bills each.这些页面涵盖 2002 年至 2012 年，每页代表不到 1,000 张账单。

For this, I scrape with getURL through this loop:为此，我通过这个循环使用getURL进行抓取：

b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id

lapply(l, FUN = function(x) {
  print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))

  # scrape
  data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
  data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
  data <- paste(b, x, data, sep = "/")
  data <- getURL(data)
  write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})

Is there any way to optimise the getURL() function here?有什么办法可以优化这里的getURL() function 吗？ I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:我似乎无法通过传递async=TRUE选项来使用并发下载，这每次都会给我同样的错误：

Error in function (type, msg, asError = TRUE)  : 
Failed to connect to 0.0.0.12: No route to host

Any ideas?有任何想法吗？ Thanks!谢谢！

Answer 1

Try mclapply {multicore} instead of lapply.尝试使用 mclapply {multicore} 而不是 lapply。

"mclapply is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X." “mclapply 是 lapply 的并行版本，它返回一个与 X 长度相同的列表，其中每个元素都是将 FUN 应用于 X 的相应元素的结果。” ( http://www.rforge.net/doc/packages/multicore/mclapply.html ) ( http://www.rforge.net/doc/packages/multicore/mclapply.html )

If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.如果这不起作用，您可能会使用XML package 获得更好的性能。像 xmlTreeParse 这样的函数使用异步调用。

"Note that xmlTreeParse does allow a hybrid style of processing that allows us to apply handlers to nodes in the tree as they are being converted to R objects. This is a style of event-driven or asynchronous calling." “请注意，xmlTreeParse 确实允许混合处理样式，允许我们在树中的节点被转换为 R 对象时将处理程序应用于树中的节点。这是一种事件驱动或异步调用的样式。” ( http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse ) ( http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse )

Answer 2

Why use R?为什么要用R？ For big scraping jobs you are better off using something already developed for the task.对于大的抓取工作，你最好使用已经为该任务开发的东西。 I've had good results with Down Them All, a browser add on.我使用浏览器插件 Down Them All 取得了不错的效果。 Just tell it where to start, how deep to go, what patterns to follow, and where to dump the HTML.只需告诉它从哪里开始，到 go 有多深，遵循什么模式，以及在哪里转储 HTML。

Then use R to read the data from the HTML files.然后使用R读取HTML文件中的数据。

Advantages are massive - these add-ons are developed especially for the task so they will do multiple downloads (controllable by you), they will send the right headers so your next question won't be 'how do I set the user agent string with RCurl?', and they can cope with retrying when some of the downloads fail, which they inevitably do.优点是巨大的——这些附加组件是专门为任务开发的，所以它们会进行多次下载（由你控制），它们会发送正确的标题，这样你的下一个问题就不会是“我如何设置用户代理字符串” RCurl?'，他们可以在某些下载失败时重试，这是他们不可避免的。

Of course the disadvantage is that you can't easily start this process automatically, in which case maybe you'd be better off with 'curl' on the command line, or some other command-line mirroring utility.当然，缺点是您不能轻松地自动启动此过程，在这种情况下，您最好在命令行上使用“curl”或其他一些命令行镜像实用程序。

Honestly, you've got better things to do with your time than write website code in R...老实说，与在 R 中编写网站代码相比，您有更好的时间来做……

如何在 R 中使用 getURL() 优化抓取

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-03-18 23:13:50

解决方案2
-5 2012-04-09 08:03:12

如何在 R 中使用 getURL() 优化抓取

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-03-18 23:13:50

解决方案2 -5 2012-04-09 08:03:12

解决方案1
1 已采纳 2014-03-18 23:13:50

解决方案2
-5 2012-04-09 08:03:12