如何使用XML和ReadHTMLTable抓取多個頁面？

Question

我正在使用XML包將芝加哥馬拉松的結果寫成CSV。 問題是該網站只能在一個頁面上顯示1,000個跑步者，所以我必須刮掉多個頁面。 我到目前為止編寫的腳本適用於第一頁：

rm(list=ls())

library(XML)

page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page", 
page_numbers, 
sep = "="
)

tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

times <- tables[[which.max(n.rows)]]

如何使用此代碼刮取所有21頁以獲得完整的結果。 我應該使用for()循環還是lapply函數或其他東西，我在這里有點迷失。

謝謝！

Answer 1

將頁碼添加到每個URL。

page_numbers <- 1:1429
urls <- paste(
  "http://results.public.chicagomarathon.com/2011/index.php?pid=list&page", 
  page_numbers, 
  sep = "="
)

現在遍歷每個頁面，抓取每個頁面。 使用for循環或*apply函數並不重要。 參見例如R Inferno的圓圈4（pdf），討論'for'循環和'lapply'之間的區別。

Answer 2

這是一種有效的方法。 您的方法失敗的原因是您沒有描述整個網頁。 稍微玩一下，為每個頁面提供正確的網址格式，之后就會出現問題。

url1 = 'http://results.public.chicagomarathon.com/2011/index.php?page='
url3 = '&content=list&event=MAR&num_results=25'

# GET TABLE FROM PAGE NUMBER
getPage <- function(page){
  require(XML)
  url = paste(url1, page, url3, sep = "")
  tab = readHTMLTable(url, stringsAsFactors = FALSE)[[1]]
  return(tab)
}

require(plyr)
# for some reason ldply fails, hence the llply + rbind workaround
pages    = llply(1:10, getPage, .progress = 'text') 
marathon = do.call('rbind', pages)

如何使用XML和ReadHTMLTable抓取多個頁面？

問題描述

2 個解決方案

解決方案1
3 2011-10-15 00:56:31

解決方案2
2 2011-10-15 04:22:32

如何使用XML和ReadHTMLTable抓取多個頁面？

問題描述

2 個解決方案

解決方案1 3 2011-10-15 00:56:31

解決方案2 2 2011-10-15 04:22:32

解決方案1
3 2011-10-15 00:56:31

解決方案2
2 2011-10-15 04:22:32