简体   繁体   English

从javascript网页提取文本

[英]Extract Text from javascript webpage

I want to use R to extract some text from a website. 我想使用R从网站中提取一些文本。 I am not able to access text using Rvest. 我无法使用Rvest访问文本。 The area I am interested in is the section 'Principal Investment Strategies' If I can extract that section I can use Grep to further analyze the text. 我感兴趣的领域是“主要投资策略”部分。如果可以提取该部分,则可以使用Grep进一步分析文本。 But obtaining the section in extractable format is proving to be a challenge. 但是事实证明,以可提取的格式获取该节是一个挑战。

Link to the site is as follows: http://quote.morningstar.com/etf-filing/Summary-Prospectus/2017/8/28/t.aspx?t=AGG&ft=497K&d=c6995d020ec0f1b3592873780a199bd1 该网站的链接如下: http : //quote.morningstar.com/etf-filing/Summary-Prospectus/2017/8/28/t.aspx?t=AGG&ft=497K&d=c6995d020ec0f1b3592873780a199bd1

Using rvest to extract the complete text of that part (iframe) and perhaps you could use regex or tokenizer to extract the part you want from the text: 使用rvest提取该部分(iframe)的完整文本,也许您可​​以使用regex或tokenizer从文本中提取所需的部分:

link <- 'http://quote.morningstar.com/etf-filing/Summary-Prospectus/2017/8/28/t.aspx?t=AGG&ft=497K&d=c6995d020ec0f1b3592873780a199bd1'
library(rvest)
library(magrittr)
link  %>%
  read_html() %>%
  html_nodes("iframe") %>%
  extract(4) %>%
  html_attr("src") %>% 
  read_html() %>%
  html_text() 

Taking it on faith that you've done what you said (it's hard to be sure with no code sample). 相信您已经完成了您所说的话(没有代码示例很难确定)。

Grabbing that text with precise targeting, starting with the original URL, finding that iframe then just the <div> with that text. 从原始网址开始,以精确的定位方式抓取该文本,找到该iframe,然后仅找到该文本的<div>

library(rvest)

read_html("http://quote.morningstar.com/etf-filing/Summary-Prospectus/2017/8/28/t.aspx?t=AGG&ft=497K&d=c6995d020ec0f1b3592873780a199bd1") %>% 
  html_node("iframe.sec_frame") %>% 
  html_attr("src") %>% 
  read_html() -> pg

html_node(pg, xpath=".//div[contains(., 'Principal Investment Strategies
')]") %>% 
  html_text()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM