简体   繁体   English

如何使用XML和ReadHTMLTable抓取多个页面?

[英]How do I scrape multiple pages with XML and ReadHTMLTable?

I'm using the XML package to scrape results from the Chicago marathon into a CSV. 我正在使用XML包将芝加哥马拉松的结果写成CSV。 The problem is that the site can only display 1,000 runners on a single page, so I have to scrape multiple pages. 问题是该网站只能在一个页面上显示1,000个跑步者,所以我必须刮掉多个页面。 The script I've written so far works for the first page: 我到目前为止编写的脚本适用于第一页:

rm(list=ls())

library(XML)

page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page", 
page_numbers, 
sep = "="
)

tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

times <- tables[[which.max(n.rows)]]

How can I use this code to scrape all 21 pages to get the complete results. 如何使用此代码刮取所有21页以获得完整的结果。 Should I use a for() loop or an lapply function or something else, I'm a bit lost here. 我应该使用for()循环还是lapply函数或其他东西,我在这里有点迷失。

Thanks! 谢谢!

Add the page number to each URL. 将页码添加到每个URL。

page_numbers <- 1:1429
urls <- paste(
  "http://results.public.chicagomarathon.com/2011/index.php?pid=list&page", 
  page_numbers, 
  sep = "="
)

Now loop over each page, scraping each one. 现在遍历每个页面,抓取每个页面。 It doesn't matter too much whether you use a for loop or an *apply function. 使用for循环或*apply函数并不重要。 See, eg, Circle 4 of the R Inferno (pdf) for a discussion of the difference between 'for' loops and 'lapply'. 参见例如R Inferno的圆圈4(pdf),讨论'for'循环和'lapply'之间的区别。

Here is an approach that works. 这是一种有效的方法。 The reason your approach failed was because you were not describing the entire webpage. 您的方法失败的原因是您没有描述整个网页。 A little bit of playing around, gives you the correct format of the url for each page, after which things fall in place. 稍微玩一下,为每个页面提供正确的网址格式,之后就会出现问题。

url1 = 'http://results.public.chicagomarathon.com/2011/index.php?page='
url3 = '&content=list&event=MAR&num_results=25'

# GET TABLE FROM PAGE NUMBER
getPage <- function(page){
  require(XML)
  url = paste(url1, page, url3, sep = "")
  tab = readHTMLTable(url, stringsAsFactors = FALSE)[[1]]
  return(tab)
}

require(plyr)
# for some reason ldply fails, hence the llply + rbind workaround
pages    = llply(1:10, getPage, .progress = 'text') 
marathon = do.call('rbind', pages)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM