简体   繁体   English

R:循环浏览链接列表

[英]R: looping through a list of links

I have some code that scrapes data off this link ( http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280 ) and runs some calculations. 我有一些代码可以从此链接中抓取数据( http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280 )并运行一些计算。

What I want to do is cycle through every team and collect and run the manipulations on every team. 我要做的是循环浏览每个团队,并收集和运行每个团队的操作。 I have a dataframe with every team link, like the one above. 我有一个数据框,其中包含每个团队链接,例如上面的链接。

Psuedo code: for (link in teamlist) {scrape, manipulate, put into a table} 伪代码:用于(团队列表中的链接){抓取,操作,放入表格中}

However, I can't figure out how to run loop through the links. 但是,我不知道如何通过链接运行循环。

I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). 我尝试做URL = teamlist $ link [i],但是使用readhtmltable()时出现错误。 I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table. 我没有麻烦手动将每个团队的个人URL粘贴到脚本中,仅在尝试将其从表中拉出时。

Current code: 当前代码:

library(XML)
library(gsubfn)

URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'  
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)

Thanks. 谢谢。

I agree with @ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. 我同意@ialm,应该检查rvest软件包,这使通过链接循环变得非常有趣和直接。 I will create some example code here using similar subject matter for you to check out. 我将在此处使用类似的主题创建一些示例代码,以供您查看。

Here I am generating a list of links that I will iterate through 在这里,我生成了一个链接列表,我将对其进行迭代

rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"

urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()

Now that the list of links is complete I iterate through each link and pull a table from each 现在链接列表已完成,我遍历每个链接并从每个链接中提取一个表

teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}

Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes . 请参见下面的代码,该代码基本上可以构建您的代码,并通过vector team_codes标识的两个不同的团队页面循环。 The tables are returned in a list where each list element corresponds to a team's table. 这些表以列表的形式返回,其中每个列表元素都对应于团队的表。 However, the tables look like they will need more cleaning. 但是,桌子似乎需要更多清洁。

library(XML)
library(gsubfn)

Player_Stats <- list()
j <- 1
team_codes <-  c(575, 580)
for(code in team_codes) {

  URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
  tx<- readLines(URL)
  tx2<-gsub("</tbody>","",tx)
  tx2<-gsub("<tfoot>","",tx2)
  tx2<-gsub("</tfoot>","</tbody>",tx2)
  Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
  j <- j + 1

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM