简体   繁体   English

使用R从HTML元素中提取日期

[英]Extracting the date from an HTML element using R

I'm trying to see if I can extract dates from an online community using R. At the moment, I'm a bit of a newcomer, but not having much luck using the R package. 我试图看看我是否可以使用R从在线社区中提取日期。目前,我有点像新人,但使用R包没有太多运气。 It seems to pull a huge list rather than any specific date or time. 它似乎拉出了一个巨大的列表,而不是任何特定的日期或时间。

I've tried using the Rvest package to read URL and then select the HTML element I want to extract the date. 我已经尝试使用Rvest包来读取URL,然后选择我想要提取日期的HTML元素。 I just can't find the date anywhere within it. 我在其中的任何地方都找不到日期。

This is what I've tried so far. 这是我到目前为止所尝试的。

  discussion <- read_html("https://en.community.sonos.com/wireless-speakers-228992/bass-cutting-out-on-play-5-will-come-back-intermittently-when-volume-is-turned-up-5568948")
  local.date <- discussion %>% 
  html_nodes(".qa-latest-post-time") %>% html_text()
  discussion

Is there a better way? 有没有更好的办法?

Ideally I'd get a specific date (and time) from this. 理想情况下,我会得到一个特定的日期(和时间)。 If not, at least a specific date would be useful. 如果没有,至少特定日期将是有用的。

You're selecting the nodes' text but the date information is stored in an attribute (you can find this out by printing the HTML nodes themselves): 您正在选择节点的文本,但日期信息存储在属性中 (您可以通过打印HTML节点本身来找到它):

discussion %>% html_nodes('.qa-latest-post-time') %>% html_attr('datetime')

Ideally I'd get a specific date (and time) from this. 理想情况下,我会得到一个特定的日期(和时间)。

The site's source code does not seem to contain post times — at least not in your example. 该网站的源代码似乎不包含发布时间 - 至少在您的示例中没有。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM