简体   繁体   English

关于 HTML 代码的混淆 Web 与 R 刮擦

[英]Confusion Regarding HTML Code For Web Scraping With R

I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML.我正在努力在 R 中使用rvest package,这很可能是由于我对 CSS331ZA877A4DB 缺乏了解。 Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu..." but no luck either):这是一个示例(我的猜测是“.quote-header-info”出了什么问题,也尝试了“.Trsdu ...”但也没有运气):

library(rvest)
url="https://finance.yahoo.com/quote/SPY"

website=read_html(url) %>%
  html_nodes(".quote-header-info") %>%
  html_text() %>% toString()

website

The below is the webpage I am trying to scrape.以下是我要抓取的网页。 Specifically looking to grab the value "416.74".特别希望获取值“416.74”。 I took a peek at the documentation here ( https://cran.r-project.org/web/packages/rvest/rvest.pdf ) but think the issue is I don't understand the breakdown of the webpage I am looking at.我查看了此处的文档( https://cran.r-project.org/web/packages/rvest/rvest.pdf )但认为问题是我不明白我正在查看的网页的故障.

在此处输入图像描述

The tricky part is determining the correct set of attributes to only select this one html node.棘手的部分是确定仅 select 这个 html 节点的正确属性集。

In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)在这种情况下,跨度标签具有Trsdu(0.3s)Fz( 36px) 的 class

library(rvest)
url="https://finance.yahoo.com/quote/SPY"

#read page once
page <- read_html(url)

#now extract information from the page
price <- page %>%  html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
   html_text()

price

Note: "(", ")", and "."注意:“(”、“)”和“.” are all special characters thus the need to double escape "\\" them.都是特殊字符,因此需要双重转义“\\”它们。

Those classes are dynamic and change much more frequently than other parts of the html.这些类是动态的,并且比 html 的其他部分更频繁地更改。 They should be avoided.应该避免它们。 You have at least two more robust options.您至少有两个更强大的选择。

  1. Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite在脚本标签中提取包含该数据(以及更多)的 javascript 选项,然后使用 jsonlite 进行解析
  2. Use positional matching against other, more stable, html elements对其他更稳定的 html 元件使用位置匹配

I show both below.我在下面显示两者。 The advantage of the first is that you can extract lots of other page data from the json object generated.第一个的优点是您可以从生成的 json object 中提取大量其他页面数据。


library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)

page <- read_html('https://finance.yahoo.com/quote/SPY')

data <- page %>% 
  toString() %>% 
  stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]

json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM