简体   繁体   English

使用 rvest 和 R 抓取网页 html

[英]Web scraping html with rvest and R

I would like to web scrape this web site https://www.askramar.com/Ponuda .我想抓取这个网站https://www.askramar.com/Ponuda First, I should scrape all the links that lead to each car page.首先,我应该抓取通向每个汽车页面的所有链接。 The extended links look like this in the html structure:扩展链接在 html 结构中如下所示:

在此处输入图片说明

I tried the following code but I get an empty object in R:我尝试了以下代码,但在 R 中得到一个空对象:

url <- "https://www.askramar.com/Ponuda"
html_document <- read_html(url)


links <- html_document %>%
  html_nodes(xpath = '//*[contains(concat(" ", @class, " "), concat(" ", "vozilo", " "))]') %>%
  html_attr(name = "href") 

Is it javascript on web page?它是网页上的javascript吗? Please help!请帮忙! Thanks!谢谢!

Yes, the page uses javascript to load the contents you are interested in. However, it does this simply by calling an xhr GET request to https://www.askramar.com/Ajax/GetResults.cshtml .是的,该页面使用 javascript 加载您感兴趣的内容。但是,它只需调用一个 xhr GET 请求到https://www.askramar.com/Ajax/GetResults.cshtml You can do the same:你也可以做到的:

url <- "https://www.askramar.com/Ajax/GetResults.cshtml?stranica="

links <- list()
for(i in 1:45)
{
  links[[i]] <- httr::GET(paste0(url, i - 1)) %>% read_html %>%
  html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
  html_attr(name = "href")
}

links <- do.call("c", links)

print(links)


# [1] "Vozilo?id=17117" "Vozilo?id=17414" "Vozilo?id=17877" "Vozilo?id=17834"
# [5] "Vozilo?id=17999" "Vozilo?id=18395" "Vozilo?id=17878" "Vozilo?id=16256"
# [9] "Vozilo?id=17465" "Vozilo?id=17560" "Vozilo?id=17912" "Vozilo?id=18150"
#[13] "Vozilo?id=18131" "Vozilo?id=17397" "Vozilo?id=18222" "Vozilo?id=17908"
#[17] "Vozilo?id=18333" "Vozilo?id=17270" "Vozilo?id=18105" "Vozilo?id=16803"
#[21] "Vozilo?id=16804" "Vozilo?id=17278" "Vozilo?id=17887" "Vozilo?id=17939"
# ...plus 1037 further elements

If you inspect the network on the page, you see it sends GET requests with many query parameters, the most important 'stranice'.如果您检查页面上的网络,您会看到它发送带有许多查询参数的 GET 请求,最重要的“stranice”。 Using the above information I did the following:使用上述信息,我做了以下事情:

library(rvest)

stranice <- 1:3

askramar_scrap <- function(stranica) {
  url <- paste0("https://www.askramar.com/Ajax/GetResults.cshtml?stanje=&filter=&lokacija=&", 
                "pojam=&marka=&model=&godinaOd=&godinaDo=&cijenaOd=&cijenaDo=&snagaOd=&snagaDo=&", 
                "karoserija=&mjenjac=&boja=&pogon4x4=&sifra=&stranica=", stranica, "&sort=")
  html_document <- read_html(url)
  links <- html_document %>%
    html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
    html_attr(name = "href")
}

links <- lapply(stranice, askramar_scrap)
links <- unlist(links)
links <- unique(links)

Hope that is what you need.希望这就是你所需要的。

<div>在 HTML 中似乎是空的(r 中的网络抓取与 rvest)</div><div id="text_translate"><p> 我正在尝试使用 r 库 rvest 从博彩网站上抓取一些数据。<br> 为了获取这些值,我需要单击表格中的一些超链接。<br> 为此,我使用以下代码:</p><pre> odds_link &lt;- "https://www.oddsportal.com/soccer/germany/bundesliga/results/" odds_page &lt;- read_html(odds_link) node_table &lt;- html_node(xpath = '//*[@id="tournamentTable"]')</pre><p> 我用这个<a href="https://i.stack.imgur.com/fF5Hy.png" rel="nofollow noreferrer">xpath</a>和 node_table 返回这个</p><pre>{xml_nodeset (1)} [1] &lt;div id="tournamentTable"&gt;&lt;/div&gt;\n</pre><p> 返回的节点看起来是空的,因为 div 标签之间没有任何东西......它应该看起来像<a href="https://i.stack.imgur.com/HImMD.png" rel="nofollow noreferrer">那样</a>。<br> 在这一点上,我很失落。 我尝试了几件事,但没有成功。</p><pre> node_table %&gt;% html_node("table") node_table %&gt;% html_table() node_table %&gt;% html_structure()</pre><p> 这是返回的:</p><pre> &gt; node_table %&gt;% html_node("table") {xml_missing} &lt;NA&gt; &gt; node_table %&gt;% html_table() Fehler in html_table.xml_node(.): html_name(x) == "table" is not TRUE &gt; node_table %&gt;% html_structure() &lt;div#tournamentTable&gt;</pre><p> 我将不胜感激这方面的帮助! (下一步是访问表中的<a href="https://i.stack.imgur.com/R2zku.png" rel="nofollow noreferrer">这些</a>超链接。)<br> 我什至无法访问超链接...</p><pre> xpath_link = '//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a' odds_page %&gt;% html_node(xpath = xpath_link)</pre><pre> &gt; odds_page %&gt;% html_node(xpath = xpath_link) {xml_missing} &lt;NA&gt;</pre></div> - <div> in HTML seems to be empty (web scraping in r with rvest)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 RVest和R进行Web爬网 - Web Scraping with rvest and R 使用R和rvest进行Web抓取 - Web scraping with R and rvest R - 与rvest形成网络抓取 - R - form web scraping with rvest 在R中使用rvest进行网页爬取 - Web Scraping using rvest in R 使用 Rselenium 和 rvest 进行 R 网络抓取 - R web scraping with Rselenium and rvest R Web 刮硬币市值 - R Web scraping coinmarketcap with rvest 使用RVest在R中进行Web抓取 - Web scraping in R using rvest <div>在 HTML 中似乎是空的(r 中的网络抓取与 rvest)</div><div id="text_translate"><p> 我正在尝试使用 r 库 rvest 从博彩网站上抓取一些数据。<br> 为了获取这些值,我需要单击表格中的一些超链接。<br> 为此,我使用以下代码:</p><pre> odds_link &lt;- "https://www.oddsportal.com/soccer/germany/bundesliga/results/" odds_page &lt;- read_html(odds_link) node_table &lt;- html_node(xpath = '//*[@id="tournamentTable"]')</pre><p> 我用这个<a href="https://i.stack.imgur.com/fF5Hy.png" rel="nofollow noreferrer">xpath</a>和 node_table 返回这个</p><pre>{xml_nodeset (1)} [1] &lt;div id="tournamentTable"&gt;&lt;/div&gt;\n</pre><p> 返回的节点看起来是空的,因为 div 标签之间没有任何东西......它应该看起来像<a href="https://i.stack.imgur.com/HImMD.png" rel="nofollow noreferrer">那样</a>。<br> 在这一点上,我很失落。 我尝试了几件事,但没有成功。</p><pre> node_table %&gt;% html_node("table") node_table %&gt;% html_table() node_table %&gt;% html_structure()</pre><p> 这是返回的:</p><pre> &gt; node_table %&gt;% html_node("table") {xml_missing} &lt;NA&gt; &gt; node_table %&gt;% html_table() Fehler in html_table.xml_node(.): html_name(x) == "table" is not TRUE &gt; node_table %&gt;% html_structure() &lt;div#tournamentTable&gt;</pre><p> 我将不胜感激这方面的帮助! (下一步是访问表中的<a href="https://i.stack.imgur.com/R2zku.png" rel="nofollow noreferrer">这些</a>超链接。)<br> 我什至无法访问超链接...</p><pre> xpath_link = '//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a' odds_page %&gt;% html_node(xpath = xpath_link)</pre><pre> &gt; odds_page %&gt;% html_node(xpath = xpath_link) {xml_missing} &lt;NA&gt;</pre></div> - <div> in HTML seems to be empty (web scraping in r with rvest) 使用rvest并找到html_note在R中进行Web Scraping - Web Scraping in R using rvest and finding the html_note Web抓取:使用RVEST在R中提取文本 - Web scraping: Extract text in R using RVEST
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM