如何使用XML和ReadHTMLTable抓取多个页面？

Question

I'm using the XML package to scrape results from the Chicago marathon into a CSV. 我正在使用XML包将芝加哥马拉松的结果写成CSV。 The problem is that the site can only display 1,000 runners on a single page, so I have to scrape multiple pages. 问题是该网站只能在一个页面上显示1,000个跑步者，所以我必须刮掉多个页面。 The script I've written so far works for the first page: 我到目前为止编写的脚本适用于第一页：

rm(list=ls())

library(XML)

page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page", 
page_numbers, 
sep = "="
)

tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

times <- tables[[which.max(n.rows)]]

How can I use this code to scrape all 21 pages to get the complete results. 如何使用此代码刮取所有21页以获得完整的结果。 Should I use a for() loop or an lapply function or something else, I'm a bit lost here. 我应该使用for()循环还是lapply函数或其他东西，我在这里有点迷失。

Thanks! 谢谢！

Answer 1

Add the page number to each URL. 将页码添加到每个URL。

page_numbers <- 1:1429
urls <- paste(
  "http://results.public.chicagomarathon.com/2011/index.php?pid=list&page", 
  page_numbers, 
  sep = "="
)

Now loop over each page, scraping each one. 现在遍历每个页面，抓取每个页面。 It doesn't matter too much whether you use a for loop or an *apply function. 使用for循环或*apply函数并不重要。 See, eg, Circle 4 of the R Inferno (pdf) for a discussion of the difference between 'for' loops and 'lapply'. 参见例如R Inferno的圆圈4（pdf），讨论'for'循环和'lapply'之间的区别。

Answer 2

Here is an approach that works. 这是一种有效的方法。 The reason your approach failed was because you were not describing the entire webpage. 您的方法失败的原因是您没有描述整个网页。 A little bit of playing around, gives you the correct format of the url for each page, after which things fall in place. 稍微玩一下，为每个页面提供正确的网址格式，之后就会出现问题。

url1 = 'http://results.public.chicagomarathon.com/2011/index.php?page='
url3 = '&content=list&event=MAR&num_results=25'

# GET TABLE FROM PAGE NUMBER
getPage <- function(page){
  require(XML)
  url = paste(url1, page, url3, sep = "")
  tab = readHTMLTable(url, stringsAsFactors = FALSE)[[1]]
  return(tab)
}

require(plyr)
# for some reason ldply fails, hence the llply + rbind workaround
pages    = llply(1:10, getPage, .progress = 'text') 
marathon = do.call('rbind', pages)

如何使用XML和ReadHTMLTable抓取多个页面？

问题描述

2 个解决方案

解决方案1
3 2011-10-15 00:56:31

解决方案2
2 2011-10-15 04:22:32

如何使用XML和ReadHTMLTable抓取多个页面？

问题描述

2 个解决方案

解决方案1 3 2011-10-15 00:56:31

解决方案2 2 2011-10-15 04:22:32

解决方案1
3 2011-10-15 00:56:31

解决方案2
2 2011-10-15 04:22:32