I'm using the XML package to scrape results from the Chicago marathon into a CSV. The problem is that the site can only display 1,000 runners on a single page, so I have to scrape multiple pages. The script I've written so far works for the first page:
rm(list=ls())
library(XML)
page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page",
page_numbers,
sep = "="
)
tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
times <- tables[[which.max(n.rows)]]
How can I use this code to scrape all 21 pages to get the complete results. Should I use a for()
loop or an lapply
function or something else, I'm a bit lost here.
Thanks!
Add the page number to each URL.
page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?pid=list&page",
page_numbers,
sep = "="
)
Now loop over each page, scraping each one. It doesn't matter too much whether you use a for
loop or an *apply
function. See, eg, Circle 4 of the R Inferno (pdf) for a discussion of the difference between 'for' loops and 'lapply'.
Here is an approach that works. The reason your approach failed was because you were not describing the entire webpage. A little bit of playing around, gives you the correct format of the url for each page, after which things fall in place.
url1 = 'http://results.public.chicagomarathon.com/2011/index.php?page='
url3 = '&content=list&event=MAR&num_results=25'
# GET TABLE FROM PAGE NUMBER
getPage <- function(page){
require(XML)
url = paste(url1, page, url3, sep = "")
tab = readHTMLTable(url, stringsAsFactors = FALSE)[[1]]
return(tab)
}
require(plyr)
# for some reason ldply fails, hence the llply + rbind workaround
pages = llply(1:10, getPage, .progress = 'text')
marathon = do.call('rbind', pages)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.