使用XML包中的readHTMLtable来搜索站点，不确定的错误消息

Question

I'm using the XML package to scrape a list of websites. 我正在使用XML包来抓取网站列表。 Specifically, i'm taking ratings from a list of candidates, at the following site: votesmart . 具体来说，我正在以下网站上的候选人名单中评分： votmart 。

The candidates' pages are arranged in a numerical order, from 1 upwards. 候选人的页面按数字顺序排列，从1开始。 My first attempt, to scrape the first 50 candidates, looks like this 我的第一次尝试，刮掉前50名候选人，看起来像这样

library(xml)
library(plyr)

url <- paste("http://www.votesmart.org/candidate/evaluations/", 1:50 , sep = "")
res <- llply(url, function(i) readHTMLtable(i))

But there are a couple of problems--for instance, the 25th page in this sequence generates a 404 "url not found" error. 但是有两个问题-例如，此序列中的第25 页会产生404 "url not found"错误。 I've addressed this by first getting a data frame of the count of XML errors for each page in a sequence, and then excluding the pages which have a single error. 我已经通过首先获取序列中每个页面的XML错误计数的数据帧，然后排除具有单个错误的页面来解决此问题。 Specifically 特别

errors <- ldply(url, function(i) length(getXMLErrors(i)))
url2 <- url[which(errors$V1 > 1)]
res2 <- llply(url2, function(i) readHTMLTable(i))

In this way, I've excluded the 404 generating URLs from this list. 这样，我就从这个列表中排除了404生成的URL。

However, there's still a problem, caused by numerous pages in the list, which cause this llply commands to fail. 但是，仍然存在由列表中的许多页面引起的问题，这导致此llply命令失败。 The following is an example 以下是一个例子

readHTMLTable("http://www.votesmart.org/candidate/evaluations/6")

which results in the error 这会导致错误

Error in seq.default(length = max(numEls)) : 
  length must be non-negative number
In addition: Warning message:
In max(numEls) : no non-missing arguments to max; returning -Inf

However, these pages generate the same error count from the getXMLErrors command as the working pages, so I'm unable to distinguish between them on this front. 但是，这些页面从getXMLErrors命令生成的错误计数与工作页面相同，因此在此方面我无法区分它们。

My question is--what does this error mean, and is there any way to get readHTMLTable to return an empty list for these pages, rather than an error? 我的问题是-此错误是什么意思，有什么办法让readHTMLTable返回这些页面的空列表，而不是错误？ Failing that, is there a way I can my llply statement to check these pages and skip those which result in an error? 失败了，有没有办法我的llply语句可以检查这些页面并跳过那些导致错误的页面？

Answer 1

为什么不只是一些简单的错误处理？

res <- llply(url, function(i) try(readHTMLTable(i)))

使用XML包中的readHTMLtable来搜索站点，不确定的错误消息

问题描述

1 个解决方案

解决方案1
3 已采纳 2012-02-22 23:07:11

使用XML包中的readHTMLtable来搜索站点，不确定的错误消息

问题描述

1 个解决方案

解决方案1 3 已采纳 2012-02-22 23:07:11

解决方案1
3 已采纳 2012-02-22 23:07:11