简体   繁体   中英

How do I scrape multiple pages with XML and ReadHTMLTable?

I'm using the XML package to scrape results from the Chicago marathon into a CSV. The problem is that the site can only display 1,000 runners on a single page, so I have to scrape multiple pages. The script I've written so far works for the first page:

rm(list=ls())

library(XML)

page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page", 
page_numbers, 
sep = "="
)

tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

times <- tables[[which.max(n.rows)]]

How can I use this code to scrape all 21 pages to get the complete results. Should I use a for() loop or an lapply function or something else, I'm a bit lost here.

Thanks!

Add the page number to each URL.

page_numbers <- 1:1429
urls <- paste(
  "http://results.public.chicagomarathon.com/2011/index.php?pid=list&page", 
  page_numbers, 
  sep = "="
)

Now loop over each page, scraping each one. It doesn't matter too much whether you use a for loop or an *apply function. See, eg, Circle 4 of the R Inferno (pdf) for a discussion of the difference between 'for' loops and 'lapply'.

Here is an approach that works. The reason your approach failed was because you were not describing the entire webpage. A little bit of playing around, gives you the correct format of the url for each page, after which things fall in place.

url1 = 'http://results.public.chicagomarathon.com/2011/index.php?page='
url3 = '&content=list&event=MAR&num_results=25'

# GET TABLE FROM PAGE NUMBER
getPage <- function(page){
  require(XML)
  url = paste(url1, page, url3, sep = "")
  tab = readHTMLTable(url, stringsAsFactors = FALSE)[[1]]
  return(tab)
}

require(plyr)
# for some reason ldply fails, hence the llply + rbind workaround
pages    = llply(1:10, getPage, .progress = 'text') 
marathon = do.call('rbind', pages)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM