简体   繁体   English

R中的网页抓取html

[英]web scraping html in R

I want get the URL list from scraping http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm like this: 我想从抓取http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm获取网址列表,如下所示:

[1] "P-Obama-Inaugural-Speech-Inauguration.htm"
[2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"

and this is my code: 这是我的代码:

library(XML)

url = "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc = htmlTreeParse(url, useInternalNodes = T)
url.list = xpathSApply(doc, "//a[contains(@href, 'htm')]")

The problem is that I want to unlist() url.list so I can strsplit it but it doesn't unlist . 问题是我想unlist() url.list以便我可以strsplit它,但unlist

One more step ought to do it (just need to get the href attribute): 应该再做一步(只需要获取href属性):

library(XML)

url <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)

url.list <- xpathSApply(doc, "//a[contains(@href, 'htm')]")
hrefs <- gsub("^/", "", sapply(url.list, xmlGetAttr, "href"))

head(hrefs, 6)

## [1] "P-Obama-Inaugural-Speech-Inauguration.htm"                                                                                                 
## [2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"                                                    
## [3] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"                                                    
## [4] "E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"                                                                       
## [5] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [6] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"

free(doc)

UPDATE Obligatory rvest + dplyr way: 更新强制RVest + dplyr方式:

library(rvest)
library(dplyr)

speeches <- html("http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm")
speeches %>% html_nodes("a[href*=htm]") %>% html_attr("href") %>% head(6)

## same output as above

<div>在 HTML 中似乎是空的(r 中的网络抓取与 rvest)</div><div id="text_translate"><p> 我正在尝试使用 r 库 rvest 从博彩网站上抓取一些数据。<br> 为了获取这些值,我需要单击表格中的一些超链接。<br> 为此,我使用以下代码:</p><pre> odds_link &lt;- "https://www.oddsportal.com/soccer/germany/bundesliga/results/" odds_page &lt;- read_html(odds_link) node_table &lt;- html_node(xpath = '//*[@id="tournamentTable"]')</pre><p> 我用这个<a href="https://i.stack.imgur.com/fF5Hy.png" rel="nofollow noreferrer">xpath</a>和 node_table 返回这个</p><pre>{xml_nodeset (1)} [1] &lt;div id="tournamentTable"&gt;&lt;/div&gt;\n</pre><p> 返回的节点看起来是空的,因为 div 标签之间没有任何东西......它应该看起来像<a href="https://i.stack.imgur.com/HImMD.png" rel="nofollow noreferrer">那样</a>。<br> 在这一点上,我很失落。 我尝试了几件事,但没有成功。</p><pre> node_table %&gt;% html_node("table") node_table %&gt;% html_table() node_table %&gt;% html_structure()</pre><p> 这是返回的:</p><pre> &gt; node_table %&gt;% html_node("table") {xml_missing} &lt;NA&gt; &gt; node_table %&gt;% html_table() Fehler in html_table.xml_node(.): html_name(x) == "table" is not TRUE &gt; node_table %&gt;% html_structure() &lt;div#tournamentTable&gt;</pre><p> 我将不胜感激这方面的帮助! (下一步是访问表中的<a href="https://i.stack.imgur.com/R2zku.png" rel="nofollow noreferrer">这些</a>超链接。)<br> 我什至无法访问超链接...</p><pre> xpath_link = '//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a' odds_page %&gt;% html_node(xpath = xpath_link)</pre><pre> &gt; odds_page %&gt;% html_node(xpath = xpath_link) {xml_missing} &lt;NA&gt;</pre></div> - <div> in HTML seems to be empty (web scraping in r with rvest)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R网页抓取-HTML中的隐藏文本 - R web-scraping - hidden text in HTML Web抓取:Chrome开发人员工具可看到html结构,但htmlTreeParse(R)无法看到 - Web scraping: html structure visible with chrome developer tool, but not with htmlTreeParse (R) <div>在 HTML 中似乎是空的(r 中的网络抓取与 rvest)</div><div id="text_translate"><p> 我正在尝试使用 r 库 rvest 从博彩网站上抓取一些数据。<br> 为了获取这些值,我需要单击表格中的一些超链接。<br> 为此,我使用以下代码:</p><pre> odds_link &lt;- "https://www.oddsportal.com/soccer/germany/bundesliga/results/" odds_page &lt;- read_html(odds_link) node_table &lt;- html_node(xpath = '//*[@id="tournamentTable"]')</pre><p> 我用这个<a href="https://i.stack.imgur.com/fF5Hy.png" rel="nofollow noreferrer">xpath</a>和 node_table 返回这个</p><pre>{xml_nodeset (1)} [1] &lt;div id="tournamentTable"&gt;&lt;/div&gt;\n</pre><p> 返回的节点看起来是空的,因为 div 标签之间没有任何东西......它应该看起来像<a href="https://i.stack.imgur.com/HImMD.png" rel="nofollow noreferrer">那样</a>。<br> 在这一点上,我很失落。 我尝试了几件事,但没有成功。</p><pre> node_table %&gt;% html_node("table") node_table %&gt;% html_table() node_table %&gt;% html_structure()</pre><p> 这是返回的:</p><pre> &gt; node_table %&gt;% html_node("table") {xml_missing} &lt;NA&gt; &gt; node_table %&gt;% html_table() Fehler in html_table.xml_node(.): html_name(x) == "table" is not TRUE &gt; node_table %&gt;% html_structure() &lt;div#tournamentTable&gt;</pre><p> 我将不胜感激这方面的帮助! (下一步是访问表中的<a href="https://i.stack.imgur.com/R2zku.png" rel="nofollow noreferrer">这些</a>超链接。)<br> 我什至无法访问超链接...</p><pre> xpath_link = '//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a' odds_page %&gt;% html_node(xpath = xpath_link)</pre><pre> &gt; odds_page %&gt;% html_node(xpath = xpath_link) {xml_missing} &lt;NA&gt;</pre></div> - <div> in HTML seems to be empty (web scraping in r with rvest) R:如何在网页抓取中切换HTML页面选择 - R: how to toggle html page selection in web scraping Web 使用 R 抓取 - 非 HTML 内容 - 由 Z035489FF8D092741943E4A8 生成的应用程序 - Web scraping with R - Non HTML content - Application generated by Firebase R中的网络抓取? - Web scraping in R? 使用 SelectorGadget 在 r 中抓取网页 - web scraping in r with SelectorGadget 使用 R 刮擦更快的 web - Faster web scraping with R 用 R 刮擦 html header - Scraping html header with R R中的HTML表抓取 - HTML table scraping in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM