简体   繁体   English

使用R刮取多个页面的HTML表格

[英]scrape HTML table with multiple pages using R

I am trying to make a data frame by scraping from the web. 我试图通过从网上抓取来制作数据框。 But there are multiple pages that make up the table I am trying to scrape. 但是有很多页面组成了我试图抓住的表格。 same link, but page is different. 相同的链接,但页面是不同的。

for the first page, this is how I would scrape it: 对于第一页,我就是这样做的:

library(XML)
CB.13<- "http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p=1&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
CB.13<- readHTMLTable(CB.13, header=FALSE)
cornerback.function<- function(CB.13){
  first<- "1"
  last<- "1"
  for (i in 1:length(CB.13)){
    lastrow<- nrow(CB.13[[i]])
    lastcol<- ncol(CB.13[[i]])
    if(as.numeric(CB.13[[i]] [1,1]) ==first & as.numeric(CB.13[[i]] [lastrow, lastcol]) ==last) {
      tab <- i
    }
  }
}
cornerback.function(CB.13)
cornerbacks.2013<- CB.13[[tab]]
cb.names<- c("Rk", "name", "Team", "Pos", "Comb", "Total", "Ast", "Sck", "SFTY", "PDef", "Int", "TDs", "Yds", "Lng", "FF", "Rec", "TD")
names(cornerbacks.2013)<- cb.names

I need to do this for multiple years, all with multiple pages- so is there a quicker way to get all of the pages of the data instead of having to do this for each individual page of the table and merge them? 我需要多年这样做,所有这些都有多个页面 - 所以有更快的方法来获取数据的所有页面而不必为表格的每个页面执行此操作并合并它们吗? the next link would be http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true 下一个链接是http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true

and there are 8 pages for this year- maybe a for loop to loop through pages? 今年有8页 - 也许是for循环遍历页面?

You can dynamically create the url using paste0 since that they slightly differ. 您可以使用paste0动态创建URL,因为它们略有不同。 For a certain year you change just the page number. 在某一年中,您只需更改页码。 You get an url structure like : 你得到一个网址结构,如:

url <- paste0(url1,year,url2,page,url3) ## you change page or year or both

You can create a function to loop over different page, and return a table. 您可以创建一个循环遍历不同页面的函数,并返回一个表。 Then you can bind them using the classic do.call(rbind,..) : 然后你可以使用经典的do.call(rbind,..)绑定它们:

library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"

getTable <- 
  function(page=1,year=2013){
    url <- paste0(url1,year,url2,page,url3)
    tab = readHTMLTable(url,header=FALSE) ## see comment !!
    tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))

the general method 一般方法

The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. 一般方法是使用一些xpath标记废弃下一页url并循环直到没有任何新的下一页。 This is can be more difficult to do but it is the cleanest solution . 这可能更难做,但它是最干净的解决方案。

getNext <- 
function(url=url_base){
  doc <- htmlParse(url)
  XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
  next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
  if(length(next_page)>0)
    paste0("http://www.nfl.com",next_page)
  else ''
}
## url_base is your first  url
res <- list()
while(TRUE){
  tab = readHTMLTable(url_base,header=FALSE)
  res <- rbind(res,tab$result)
  url_base <- getNext(url_base)
  if (nchar(url_base)==0)
    break
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM