简体   繁体   中英

Using R to crawl multiple pages

So here it goes. Please keep in mind I am completely green when it comes to writing code, and I have no experience outside of R.

Context - Every page I want to crawl has a URL that follows this format:

http://www.hockey-reference.com/friv/dailyleaders.cgi?month= 10 &day= 8 &year= 2014

Variables that change in this url are the month, day, and year (In bold above)

The urls should start with 10-8-2014 and end on 6-18-2015. Of course not everyday has an NHL game so some pages will be blank.

All other pages have an HTML table for players and a table for goalies.

I have figured out how to crawl and export to csv for just a SINGLE page, and don't know where to go from here to make it so that I can do this in one swoop for every game last season (falling within the dates I've mentioned above)

The code is below:

library(XML)
NHL <- htmlParse("http://www.hockey-reference.com/friv/dailyleaders.cgi?month=10&day=8&year=2014")
class(NHL)
NHL.tables <- readHTMLTable(NHL,stringAsFactors = FALSE)
length(NHL.tables)

head(NHL.tables[[1]])
tail(NHL.tables[[1]])

head(NHL.tables[[2]])
tail(NHL.tables[[2]])

write.csv(NHL.tables, file = "NHLData.csv")

Thanks in advance!

I'm not sure how you want to write the csv's, but here's how you can get all the tables between those dates. I tested this on the first few URLs and it worked well. Note that you don't need to parse the html before reading the table, as readHTMLTable() is capable of reading and parsing directly from the URL.

library(XML)
library(RCurl)

# create the days
x <- seq(as.Date("2014-10-12"), as.Date("2015-06-18"), by = "day")
# create a url template for sprintf()
utmp <- "http://www.hockey-reference.com/friv/dailyleaders.cgi?month=%d&day=%d&year=%d"
# convert to numeric matrix after splitting for year, month, day
m <- do.call(rbind, lapply(strsplit(as.character(x), "-"), type.convert))
# create the list to hold the results
tables <- vector("list", length(allurls))
# get the tables
for(i in seq_len(nrow(m))) {
  # create the url for the day and if it exists, read it - if not, NULL
  tables[[i]] <- if(url.exists(u <- sprintf(utmp, m[i, 2], m[i, 3], m[i, 1]))) 
    readHTMLTable(u, stringsAsFactors = FALSE) 
  else NULL
}

The str() is quite long, so here's a small peek at the dimensions of the first element

lapply(tables[[1]], dim)
# $skaters
# [1] 72 23
#
# $goalies
# [1]  7 15

The for() loop above will construct a URL and then check that it exists for every day in our sequence. If it exists, we proceed to read the table(s) for that day. If not, that list element will be NULL . Please have a look at this and if it works for you then we'll work on writing it to file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM