网页抓取数据表

Question

我正在尝试从以下网站抓取一张桌子：

http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats

该表格的标题为“其他统计信息”，问题是此网页上有多个表格，我不知道自己是否在识别正确的表格。 我尝试了以下代码，但是它创建的只是一个空白数据框：

library(rvest)
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
tmisc <- adv %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="div_misc_stats"]') %>%
  html_table()
tmisc <- data.frame(tmisc)

我觉得我缺少一些琐碎的东西，但是我在所有Google搜索中都找不到。 任何帮助深表感谢。

Answer 1

由于所需的表一直隐藏在注释中，直到被JavaScript揭示为止，您要么需要使用RSelenium来运行JavaScript（这很痛苦），要么需要解析注释（这仍然很麻烦，但要稍微少一点点）。

library(rvest)
library(readr)    # for type_convert

adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"

h <- adv %>% read_html()    # be kind; don't rescrape unless necessary

df <- h %>% html_nodes(xpath = '//comment()') %>%    # select comments
    html_text() %>%    # extract comment text
    paste(collapse = '') %>%    # collapse to single string
    read_html() %>%    # reread as HTML
    html_node('table#misc_stats') %>%    # select desired node
    html_table() %>%    # parse node to table
    { setNames(.[-1, ], paste0(names(.), .[1, ])) } %>%    # extract names from first row
    type_convert()    # fix column types

df[1:6, 1:14]
##   Rk                   Team  Age PW PL   MOV   SOS   SRS  ORtg  DRtg Pace   FTr  3PAr   TS%
## 2  1 Golden State Warriors* 27.4 65 17 10.76 -0.38 10.38 114.5 103.8 99.3 0.250 0.362 0.593
## 3  2     San Antonio Spurs* 30.3 67 15 10.63 -0.36 10.28 110.3  99.0 93.8 0.246 0.223 0.564
## 4  3 Oklahoma City Thunder* 25.8 59 23  7.28 -0.19  7.09 113.1 105.6 96.7 0.292 0.275 0.565
## 5  4   Cleveland Cavaliers* 28.1 57 25  6.00 -0.55  5.45 110.9 104.5 93.3 0.259 0.352 0.558
## 6  5  Los Angeles Clippers* 29.7 53 29  4.28 -0.15  4.13 108.3 103.8 95.8 0.318 0.324 0.556
## 7  6       Toronto Raptors* 26.3 53 29  4.50 -0.42  4.08 110.0 105.2 92.9 0.328 0.287 0.552

Answer 2

这是另一个麻烦的解决方案。 阅读页面，保存，重新阅读，删除评论标记，然后处理页面：

gameUrl <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
gameHtml <- gameUrl %>% read_html()
#gameHtml %>% html_nodes("tbody")

#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nba.xml")

#Find and remove comments
lines<-readLines("nba.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nba2.xml")

#Read the file back in and process normally
body<-read_html("nba2.xml")

#Table 10 was found by looking at all of tables and picking the one of interest
tableofinterest<-(html_nodes(body, "tbody")[10])

rows<-html_nodes(tableofinterest, "tr")
tableOfResults<-t(sapply(rows, function(x) {html_text(html_nodes(x, "td"))}))
#find titles from the frist record's attributes
titles<-html_attrs(html_nodes(rows[1], "td"))
dfnames<-unlist(titles)[seq(2, 2*length(titles), by=2)]

#Final results are stored in data frame "df"
df<-as.data.frame(tableOfResults)
names(df)<-dfnames

该代码有效，但应该简化！ 这基于我在此处发布的类似解决方案：如何使用rvest（）获取表

网页抓取数据表

问题描述

2 个解决方案

解决方案1
2 2016-11-17 23:45:57

解决方案2
0 2016-11-18 00:14:56

网页抓取数据表

问题描述

2 个解决方案

解决方案1 2 2016-11-17 23:45:57

解决方案2 0 2016-11-18 00:14:56

解决方案1
2 2016-11-17 23:45:57

解决方案2
0 2016-11-18 00:14:56