R：Webscraping：XML 內容似乎不是 XML：使用 HTMLParse

Question

我試圖通過網絡抓取多年來的數據（由不同的網頁代表）。 我的 2019 年數據完全按照我想要的方式工作，但是當我嘗試像 2019 年數據一樣准備 2016 年數據時出現錯誤。

url19 <- 'https://www.pro-football-reference.com/draft/2019-combine.htm'

get_pfr_HTML_file19 <- GET(url19)

combine.parsed19 <- htmlParse(get_pfr_HTML_file19)

page.tables19 <- readHTMLTable(combine.parsed19, stringsAsFactors = FALSE)

data19 <- data.frame(page.tables19[1])

cleanData19 <- data19[!rowSums(data19 == "")> 0,]

cleanData19 <- filter(cleanData19, cleanData19$combine.Pos == 'CB' | cleanData19$combine.Pos == 'S')

cleanData19 正是我想要的，但是當我嘗試使用 2016 年數據運行它時，出現錯誤：XML content does not seem to be XML: ''

url16 <- 'https://www.pro-football-reference.com/draft/2016-combine.htm'

get_pfr_HTML_file16 <- GET(url16)

combine.parsed16 <- htmlParse(get_pfr_HTML_file16)

page.tables16 <- readHTMLTable(combine.parsed16, stringsAsFactors = FALSE)

data16 <- data.frame(page.tables16[1])

cleanData16 <- data16[!rowSums(data16 == "")> 0,]

cleanData16 <- filter(cleanData16, cleanData16$combine.Pos == 'CB' | cleanData16$combine.Pos == 'S')

當我嘗試運行combine.parsed16 <- htmlParse(get_pfr_HTML_file16)時出現錯誤

Answer 1

我不是 100% 確定您想要的 output，您沒有在您的示例中包含您的庫調用。 無論如何，使用此代碼您可以獲得表格

library(rvest)
library(dplyr)

url <- 'https://www.pro-football-reference.com/draft/2016-combine.htm'

read_html(url) %>% 
  html_nodes(".stats_table") %>% 
  html_table() %>% 
  as.data.frame() %>% 
  filter(Pos == 'CB' | Pos == "S")

幾年一次：

library(rvest)
library(magrittr)
library(dplyr)
library(purrr)

years <- 2013:2019
urls <- paste0(
  'https://www.pro-football-reference.com/draft/',
  years,
  '-combine.htm')

map(
  urls,
  ~read_html(.x) %>% 
    html_nodes(".stats_table") %>% 
    html_table() %>% 
    as.data.frame()
) %>%
  set_names(years) %>% 
  bind_rows(.id = "year") %>% 
  filter(Pos == 'CB' | Pos == "S")

R：Webscraping：XML 內容似乎不是 XML：使用 HTMLParse

問題描述

1 個解決方案

解決方案1
1 已采納 2020-11-14 01:32:48

R：Webscraping：XML 內容似乎不是 XML：使用 HTMLParse

問題描述

1 個解決方案

解決方案1 1 已采納 2020-11-14 01:32:48

解決方案1
1 已采納 2020-11-14 01:32:48