[英]web scraping data table with r rvest
我正在嘗試從以下網站抓取一張桌子:
http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats
該表格的標題為“其他統計信息”,問題是此網頁上有多個表格,我不知道自己是否在識別正確的表格。 我嘗試了以下代碼,但是它創建的只是一個空白數據框:
library(rvest)
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
tmisc <- adv %>%
read_html() %>%
html_nodes(xpath = '//*[@id="div_misc_stats"]') %>%
html_table()
tmisc <- data.frame(tmisc)
我覺得我缺少一些瑣碎的東西,但是我在所有Google搜索中都找不到。 任何幫助深表感謝。
由於所需的表一直隱藏在注釋中,直到被JavaScript揭示為止,您要么需要使用RSelenium來運行JavaScript(這很痛苦),要么需要解析注釋(這仍然很麻煩,但要稍微少一點點) 。
library(rvest)
library(readr) # for type_convert
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
h <- adv %>% read_html() # be kind; don't rescrape unless necessary
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#misc_stats') %>% # select desired node
html_table() %>% # parse node to table
{ setNames(.[-1, ], paste0(names(.), .[1, ])) } %>% # extract names from first row
type_convert() # fix column types
df[1:6, 1:14]
## Rk Team Age PW PL MOV SOS SRS ORtg DRtg Pace FTr 3PAr TS%
## 2 1 Golden State Warriors* 27.4 65 17 10.76 -0.38 10.38 114.5 103.8 99.3 0.250 0.362 0.593
## 3 2 San Antonio Spurs* 30.3 67 15 10.63 -0.36 10.28 110.3 99.0 93.8 0.246 0.223 0.564
## 4 3 Oklahoma City Thunder* 25.8 59 23 7.28 -0.19 7.09 113.1 105.6 96.7 0.292 0.275 0.565
## 5 4 Cleveland Cavaliers* 28.1 57 25 6.00 -0.55 5.45 110.9 104.5 93.3 0.259 0.352 0.558
## 6 5 Los Angeles Clippers* 29.7 53 29 4.28 -0.15 4.13 108.3 103.8 95.8 0.318 0.324 0.556
## 7 6 Toronto Raptors* 26.3 53 29 4.50 -0.42 4.08 110.0 105.2 92.9 0.328 0.287 0.552
這是另一個麻煩的解決方案。 閱讀頁面,保存,重新閱讀,刪除評論標記,然后處理頁面:
gameUrl <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
gameHtml <- gameUrl %>% read_html()
#gameHtml %>% html_nodes("tbody")
#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nba.xml")
#Find and remove comments
lines<-readLines("nba.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nba2.xml")
#Read the file back in and process normally
body<-read_html("nba2.xml")
#Table 10 was found by looking at all of tables and picking the one of interest
tableofinterest<-(html_nodes(body, "tbody")[10])
rows<-html_nodes(tableofinterest, "tr")
tableOfResults<-t(sapply(rows, function(x) {html_text(html_nodes(x, "td"))}))
#find titles from the frist record's attributes
titles<-html_attrs(html_nodes(rows[1], "td"))
dfnames<-unlist(titles)[seq(2, 2*length(titles), by=2)]
#Final results are stored in data frame "df"
df<-as.data.frame(tableOfResults)
names(df)<-dfnames
該代碼有效,但應該簡化! 這基於我在此處發布的類似解決方案: 如何使用rvest()獲取表
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.