简体   繁体   English

使用R从Javascript检索文本(html节点)

[英]Retrieving text (html node) from Javascript using R

I am trying to retrieve the quote "I understood at a very early age...spirit of the universe." 我正在尝试引用“我很早就明白了……宇宙的精神”这句话 and the author's name "Alice Walker" from the following Javascript code: 以及以下Javascript代码中作者的名字“ Alice Walker”

<div id="qpos_4_3" class="m-brick grid-item boxy bqQt" style="position: absolute; left: 0px; top: 33815px;">

 <div class="">

  <a href="/quotes/quotes/a/alicewalke625815.html?src=t_age" class="b-qt 
  qt_625815 oncl_q" title="view quote">I understood at a very early age that 
  in nature, I felt everything I should feel in church but never did. 
  Walking in the woods, I felt in touch with the universe and with the 
  spirit of the universe.

  </a>

  <a href="/quotes/authors/a/alice_walker.html" class="bq-aut qa_625815 
  oncl_a" title="view author">Alice Walker</a>

  </div>

  <div class="kw-box">

   <a href="/quotes/topics/topic_nature.html" class="oncl_k" data-
   idx="0">Nature</a>,

  </div>

I have used chrome's developer toolbar to get the xpath. 我使用chrome的开发人员工具栏来获取xpath。 The following code is intended to extract the quote, but it outputs character(0) . 以下代码旨在提取引号,但输出character(0) What am I doing wrong? 我究竟做错了什么?

link <-  "https://www.brainyquote.com/quotes/topics/topic_age.html"
quote <- read_html(link)

quote %>%
  html_nodes(xpath = '//*[@id="qpos_4_3"]/div[1]/a[1]') %>% 
  html_attr('view quote')

You were nearly there with your attempt. 您的尝试几乎在那里。 Note that you could extend your XPath expression to include the title you were trying to isolate with html_attr but you really wanted xml_contents . 请注意,您可以扩展XPath表达式以包含您尝试使用html_attr隔离的title ,但您确实需要xml_contents I've added magrittr only for piping and readability, it is not otherwise required... and I have coerced the results to characters assuming you will use them as such further on. 我只为管道和可读性添加了magrittr ,否则不需magrittr ……并且我将结果强制为字符, magrittr是您将继续使用它们。

get_contents <- function(link, id, title) {

  require(xml2)
  require(magrittr)

  xpath <- paste0(".//div[@id='", id, "']//a[@title='", title, "']")

  read_html(link) %>%
    xml_find_first(xpath) %>%
    xml_contents() %>%
    as.character()

}

link <-  "https://www.brainyquote.com/quotes/topics/topic_age.html"
id <- "qpos_1_10"

quote <- get_contents(link, id, "view quote")

# [1] "In our age there is no such thing as 'keeping out of politics.' All
# issues are political issues, and politics itself is a mass of lies,
# evasions, folly, hatred and schizophrenia."

author <- get_contents(link, id, "view author")

# [1] "George Orwell"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM