简体   繁体   English

rvest HTML表格抓取技术返回空列表

[英]rvest HTML table scraping techniques return empty lists

I have had success with rvest when scraping data from html tables, however, for this particular website, http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/ , when I run the code 我在从html表中抓取数据时获得了rvest成功,但是,对于这个特定的网站, http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/ ,当我运行代码时

url <- "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"
rankings <- url %>%
read_html %>%
html_nodes("table") %>%
html_table()

All that is returned is an empty list. 返回的所有内容都是一个空列表。 What might be wrong? 可能有什么问题?

The "problem" with this site is that it dynamically loads a javascript file which it then executes via a callback mechanism to create the JS data which it then builds the tables/vis from. 这个站点的“问题”是它动态加载一个javascript文件,然后通过回调机制执行该文件以创建JS数据,然后从中构建表/ vis。

One way to get the data is [R]Selenium, but that's problematic for many folks. 获取数据的一种方法是[R] Selenium,但这对许多人来说都是个问题。

Another way is to use the Developer Tools of your browser to see the JS request, run "Copy as cURL" (right-click, usually) and then use some R-fu to get what you need. 另一种方法是使用浏览器的开发人员工具查看JS请求,运行“复制为cURL”(通常右键单击),然后使用一些R-fu来获取所需内容。 Since this is going to be returning javascript, we'll need to do some mangling before ultimately converting the JSON. 由于这将返回javascript,我们需要在最终转换JSON之前进行一些修改。

library(jsonlite)
library(curlconverter)
library(httr)

# this is the `Copy as cURL` result, but you can leave it in your clipboard 
# and not do this in production. Read the `curlconverter` help for more info

CURL <- "curl 'http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54' -H 'Accept: */*' -H 'Referer: http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 11 May 2016 14:47:09 GMT' -H 'Cache-Control: max-age=0' --compressed"

req <- make_req(straighten(CURL))[[1]]
req

# that makes:

# httr::VERB(verb = "GET", url = "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016", 
#     httr::add_headers(DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
#         `Accept-Language` = "en-US,en;q=0.8", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54", 
#         Accept = "*/*", Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/", 
#         Connection = "keep-alive", `If-Modified-Since` = "Wed, 11 May 2016 14:47:09 GMT", 
#         `Cache-Control` = "max-age=0"))

# which we can transform into the following after experimenting

URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016"

pg <- GET(URL,
          add_headers(
            `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54", 
            Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"))

# now all we need to do is remove the callback

dat_from_json <- fromJSON(gsub(")$", "", gsub("^RU3_205_2016\\(", "", content(pg, as="text"))), flatten=FALSE)


# we can also try removing the JSON callback, but it will return XML instead of JSON,
# which is fine since we can parse that easily

URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD"

pg <- GET(URL,
          add_headers(
            `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54", 
            Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"))

xml_doc <- content(pg, as="parsed", encoding="UTF-8")

# but then you have to transform the XML, which I'll leave as an exercise to the OP :-)

<div>在 HTML 中似乎是空的(r 中的网络抓取与 rvest)</div><div id="text_translate"><p> 我正在尝试使用 r 库 rvest 从博彩网站上抓取一些数据。<br> 为了获取这些值,我需要单击表格中的一些超链接。<br> 为此,我使用以下代码:</p><pre> odds_link &lt;- "https://www.oddsportal.com/soccer/germany/bundesliga/results/" odds_page &lt;- read_html(odds_link) node_table &lt;- html_node(xpath = '//*[@id="tournamentTable"]')</pre><p> 我用这个<a href="https://i.stack.imgur.com/fF5Hy.png" rel="nofollow noreferrer">xpath</a>和 node_table 返回这个</p><pre>{xml_nodeset (1)} [1] &lt;div id="tournamentTable"&gt;&lt;/div&gt;\n</pre><p> 返回的节点看起来是空的,因为 div 标签之间没有任何东西......它应该看起来像<a href="https://i.stack.imgur.com/HImMD.png" rel="nofollow noreferrer">那样</a>。<br> 在这一点上,我很失落。 我尝试了几件事,但没有成功。</p><pre> node_table %&gt;% html_node("table") node_table %&gt;% html_table() node_table %&gt;% html_structure()</pre><p> 这是返回的:</p><pre> &gt; node_table %&gt;% html_node("table") {xml_missing} &lt;NA&gt; &gt; node_table %&gt;% html_table() Fehler in html_table.xml_node(.): html_name(x) == "table" is not TRUE &gt; node_table %&gt;% html_structure() &lt;div#tournamentTable&gt;</pre><p> 我将不胜感激这方面的帮助! (下一步是访问表中的<a href="https://i.stack.imgur.com/R2zku.png" rel="nofollow noreferrer">这些</a>超链接。)<br> 我什至无法访问超链接...</p><pre> xpath_link = '//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a' odds_page %&gt;% html_node(xpath = xpath_link)</pre><pre> &gt; odds_page %&gt;% html_node(xpath = xpath_link) {xml_missing} &lt;NA&gt;</pre></div> - <div> in HTML seems to be empty (web scraping in r with rvest)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用“rvest”抓取html表 - Using “rvest” scraping html table 使用 rvest 从 HTML 表格中抓取网页 - Web scraping from an HTML table using rvest <div>在 HTML 中似乎是空的(r 中的网络抓取与 rvest)</div><div id="text_translate"><p> 我正在尝试使用 r 库 rvest 从博彩网站上抓取一些数据。<br> 为了获取这些值,我需要单击表格中的一些超链接。<br> 为此,我使用以下代码:</p><pre> odds_link &lt;- "https://www.oddsportal.com/soccer/germany/bundesliga/results/" odds_page &lt;- read_html(odds_link) node_table &lt;- html_node(xpath = '//*[@id="tournamentTable"]')</pre><p> 我用这个<a href="https://i.stack.imgur.com/fF5Hy.png" rel="nofollow noreferrer">xpath</a>和 node_table 返回这个</p><pre>{xml_nodeset (1)} [1] &lt;div id="tournamentTable"&gt;&lt;/div&gt;\n</pre><p> 返回的节点看起来是空的,因为 div 标签之间没有任何东西......它应该看起来像<a href="https://i.stack.imgur.com/HImMD.png" rel="nofollow noreferrer">那样</a>。<br> 在这一点上,我很失落。 我尝试了几件事,但没有成功。</p><pre> node_table %&gt;% html_node("table") node_table %&gt;% html_table() node_table %&gt;% html_structure()</pre><p> 这是返回的:</p><pre> &gt; node_table %&gt;% html_node("table") {xml_missing} &lt;NA&gt; &gt; node_table %&gt;% html_table() Fehler in html_table.xml_node(.): html_name(x) == "table" is not TRUE &gt; node_table %&gt;% html_structure() &lt;div#tournamentTable&gt;</pre><p> 我将不胜感激这方面的帮助! (下一步是访问表中的<a href="https://i.stack.imgur.com/R2zku.png" rel="nofollow noreferrer">这些</a>超链接。)<br> 我什至无法访问超链接...</p><pre> xpath_link = '//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a' odds_page %&gt;% html_node(xpath = xpath_link)</pre><pre> &gt; odds_page %&gt;% html_node(xpath = xpath_link) {xml_missing} &lt;NA&gt;</pre></div> - <div> in HTML seems to be empty (web scraping in r with rvest) 用Rvest刮HTML:无文本 - Scraping HTML with Rvest: no text Rvest Web Scraping仅带列名带来空数据表 - Rvest Web Scraping Brings Empty Data Table With Column Names only RVest R抓取-html_table()中缺少表格 - rvest R scraping - Missing tables from html_table() R-缺少时用rvest抓取HTML表 <tr> 标签 - R - Scraping an HTML table with rvest when there are missing <tr> tags rvest 抓取 html 内容值 - rvest scraping html content values 网络抓取表以使用 rvest 获取其中的 href 返回空表 - web scraping table to get hrefs inside it using rvest returns empty table RVEST HTML从跨度刮文本 - rvest html scraping text from span
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM