簡體   English   中英

HTML Rvest 刮擦未啟動表格

[英]HTML Rvest scrape not bringing up tables

我無法從這個網站上抓取表格。 當我在一個數據表之后,我得到的只是 1 行代碼。 網站在這里。 https://mc.championdata.com/anz_premiership/index.html?competitionid=11035&matchid=110350101和我的代碼如下。

library(xml2)
library(rvest)
library(XML)

datalist = list()

web<- render_html(url = 'https://mc.championdata.com/anz_premiership/index.html?competitionid=10574&matchid=105740101')


#xpath =  '//*[@id="cd6364_SHELL_grids"]/div[1]/table'
#print(xpath)
  
#tables<- html_nodes(web, 'table')
track<- web %>%
  html_nodes(xpath = '//*[@id="cd6364_SHELL_grids"]/div[1]/table') %>%
  html_table()```

As with most modern data-rich web pages, the data you are looking for is not in the html document sent by an http request to that url. 相反,您的瀏覽器會收到 html,其中包含 javascript 代碼。 您的瀏覽器可以運行此 javascript 代碼,這會提示它進一步發送 http 請求,以獲取填充頁面的實際序列化數據(通常采用 json 格式)。 When you are web scraping with rvest or using other static web-scraping tools, the original html is received as plain text, and there is no javascript engine that will automatically work on it to generate the requests for json.

因此,您無法從該頁面獲取數據的原因是數據不在您下載的頁面上

To get round this, you have to use the console in your web browser (via F12) and find out the url at which the json is located by watching for XHR requests being made by your browser (or finding direct links embedded in the html text本身)。 在您的情況下,json 地址是https://mc.championdata.com/data/11035/fixture.json?_=9.930

您可以直接解析 json 並將其塑造成這樣的數據框:

url <- "https://mc.championdata.com/data/11035/fixture.json?_=1593081934709"
fixture <- jsonlite::read_json(url)$fixture$match

df <- do.call(rbind, lapply(fixture, function(x) 
  as.data.frame(x[names(x) %in% names(fixture[[20]])])))

dplyr::as_tibble(df)
#> # A tibble: 45 x 22
#>    awaySquadName matchType homeSquadId homeSquadShortC~ homeSquadNickna~
#>    <fct>         <fct>           <int> <fct>            <fct>           
#>  1 Central Pulse H                 802 TAC              Tactix          
#>  2 Northern Mys~ H                8120 NS               Stars           
#>  3 WBOP Magic    H                 808 STE              Steel           
#>  4 Northern Mys~ H                 809 WBM              Magic           
#>  5 Mainland Tac~ H                 808 STE              Steel           
#>  6 Central Pulse H                8120 NS               Stars           
#>  7 Mainland Tac~ H                8120 NS               Stars           
#>  8 WBOP Magic    H                 802 TAC              Tactix          
#>  9 Southern Ste~ H                 805 MYS              Mystics         
#> 10 Southern Ste~ H                8120 NS               Stars           
#> # ... with 35 more rows, and 17 more variables: matchStatus <fct>,
#> #   roundNumber <int>, homeSquadName <fct>, awaySquadNickname <fct>,
#> #   venueId <int>, awaySquadId <int>, venueCode <fct>, localStartTime <fct>,
#> #   matchId <int>, finalCode <fct>, finalShortCode <fct>, venueName <fct>,
#> #   utcStartTime <fct>, awaySquadCode <fct>, homeSquadCode <fct>,
#> #   awaySquadShortCode <fct>, matchNumber <int>

代表 package (v0.3.0) 於 2020 年 6 月 25 日創建

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM