简体   繁体   English

使用 R 从 tspan 类标签 HTML 中提取文本/数字

[英]Extracting text/number from tspan class tag HTML with R

I am trying to extract Current Production number from this website http://okg.se/sv/Produktionsinformation/ (in the blue area below).我正在尝试从此网站http://okg.se/sv/Produktionsinformation/ (在下面的蓝色区域中)提取当前生产编号。

Here is the HTML code part I need to use:这是我需要使用的 HTML 代码部分:

<tspan dy="0" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);">518</tspan>

The example of the code I used:我使用的代码示例:

url <- "http://okg.se/sv/Produktionsinformation//"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content %>% html_nodes(".content__info__item__value")

But the result I get shows that there is not nodes available:但是我得到的结果表明没有可用的节点:

{xml_nodeset (0)}

Do you have any ides on how to solve this issue?你对如何解决这个问题有什么想法吗?

Thanks in advance!提前致谢!

I am not pretty sure about the value that you need, but this work我不太确定你需要的价值,但这项工作

librar(rvest)

# page url
url <- "http://okg.se/sv/Produktionsinformation/"

# current value
read_html(url) %>%
  html_nodes(".footer__gauge") %>%
  html_attr("data-current")

# Max value
read_html(url) %>%
  html_nodes(".footer__gauge") %>%
  html_attr("data-max")

The html you see with your browser has been processed by javascript, so isn't the same as the html you see with rvest.您在浏览器中看到的 html 已被 javascript 处理过,因此与您在 rvest 中看到的 html 不同。

The raw data you are looking for is actually stored in attributes of a div with the id "gauge", so you get it like this:您要查找的原始数据实际上存储在 id 为“gauge”的div属性中,因此您可以像这样得到它:

library(rvest)
#> Loading required package: xml2

"http://okg.se/sv/Produktionsinformation//" %>%
read_html()                                 %>%
html_node("#gauge")                         %>% 
html_attrs()                                %>%
`[`(c("data-current", "data-max"))
#> data-current     data-max 
#>        "553"       "1450"

Note that you don't need to save the html to your local drive to process it.请注意,您不需要将 html 保存到本地驱动器来处理它。 You can read it directly from the internet by giving the url to read_html您可以通过将 url 提供给read_html来直接从互联网上阅读它

Created on 2020-02-20 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 2 月 20 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM