简体   繁体   English

用Rvest刮HTML:无文本

[英]Scraping HTML with Rvest: no text

I am trying to scrape information from HTML webpages, I have the direct links but cannot for some reason get to the relevant text. 我正在尝试从HTML网页中抓取信息,我具有直接链接,但由于某种原因无法访问相关文本。

These are two examples of the webpages: 这是网页的两个示例:

http://151.12.58.148:8080/CPC/CPC.detail.html?A00002 http://151.12.58.148:8080/CPC/CPC.detail.html?A00003 http://151.12.58.148:8080/CPC/CPC.detail.html?A00002 http://151.12.58.148:8080/CPC/CPC.detail.html?A00003

After I read the html, I am left with all the source code aside from the relevant text (which should change from page to page). 阅读html之后,除了相关的文本(应该逐页更改)之外,我剩下了所有源代码。

For example, the first link gives a page with this: 例如,第一个链接提供了一个包含以下内容的页面:

data di nascita 1872 数据迪纳西塔1872

which is coded, when I inspect it on my browser, as: 当我在浏览器中检查它时,编码为:

<p y:role="datasubset" y:arg="DATA_NASCITA" class="smalltitle">
     <span class="celllabel">data di nascita</span>
&nbsp;
<span y:role="multivaluedcontent" y:arg="DATA_NASCITA">1872</span>
        </p>

however, when I read it with my code: 但是,当我用代码阅读它时:

link <- 'http://151.12.58.148:8080/CPC/CPC.detail.html?A00002' 
page <- read_html(link)
write.table(as.character(page), "page.txt")

and I print "page", to check what I am getting, the same part of the code is: 然后我打印“页面”,检查我得到了什么,代码的同一部分是:

 <p y:role=\"datasubset\" y:arg=\"NASCITA\" class=\"smalltitle\">
     <span class=\"celllabel\">luogo di nascita</span> 
<span y:role=\"multivaluedcontent\" y:arg=\"NASCITA\"></span>
        </p>

without 1872, which is the piece of information I am interested in. (and also without not sure if that is indicative of anything). 没有1872,这是我感兴趣的信息。(也不确定是否可以指示任何信息)。

I can't seem to get around it, would anyone have suggestions? 我似乎无法解决它,有人会提出建议吗? Thank you very much! 非常感谢你!

To expand a bit further, the site's HTML code loads a bunch of javascript and then has a template which is filled in after the document loads and also uses the query parameter as some type of value that get computed. 为了进一步扩展,该站点的HTML代码加载了一堆JavaScript,然后有一个模板,该模板在文档加载后被填充,并且还使用查询参数作为要计算的某种类型的值。 I tried to just read in the target javascript file and parse it with V8 but there are too many external dependencies. 我试图只读取目标javascript文件并使用V8解析,但外部依赖项过多。

To read this, you'll need to use something like splashr or seleniumPipes . 要阅读此内容,您需要使用诸如splashrseleniumPipes类的东西。 I'm partial to the former as I wrote it 😎. 我写这篇文章时偏爱前者。

Using either requires running an external program. 使用任何一个都需要运行外部程序。 I will not go into how to install Splash or Selenium in this answer. 我不会在此答案中介绍如何安装Splash或Selenium。 That's leg work you have to do but splashr makes it pretty easy to use Splash if you are comfortable with Docker. 这是您必须要做的splashr工作,但如果您对Docker感到满意,则Splash可以非常轻松地使用Splash。

This bit sets up the necessary packages and starts the Splash server (it will auto-download it first if Docker is available on your system: 此位设置必要的软件包并启动Splash服务器(如果系统上可用Docker,它将首先自动下载它:

library(rvest)
library(splashr)
library(purrr)

start_splash()

This next bit tells Splash to fetch & render the page and then retrieves the page content after javascript has done its work: 接下来的这一点告诉Splash获取并呈现页面,然后在javascript完成其工作后检索页面内容:

splash_local %>% 
  splash_response_body(TRUE) %>%
  splash_user_agent(ua_macos_chrome) %>%
  splash_go("http://151.12.58.148:8080/CPC/CPC.detail.html?A00002") %>%
  splash_wait(2) %>% 
  splash_html() -> pg

Unfortunately, it's still a mess. 不幸的是,它仍然是一团糟。 They used namespaces and they are fine in XML docs but somewhat problematic the way they've used them here. 他们使用了命名空间,并且在XML文档中很好用,但是在这里使用它们的方式有些问题。 But we can work around that with some clever XPath: 但是我们可以使用一些聪明的XPath解决此问题:

html_nodes(pg, "body") %>% 
  html_nodes(xpath=".//*[local-name()='h4' or local-name()='p' or local-name()='span']/text()") %>% 
  html_text(trim=TRUE) %>% 
  discard(`==`, "")
##  [1] "Abachisti Vittorio"                        "data di nascita"                           "1872"                                     
##  [4] "luogo di nascita"                          "Mirandola, Modena, Emilia Romagna, Italia" "luogo di residenza"                       
##  [7] "Mirandola, Modena, Emilia Romagna, Italia" "colore politico"                           "socialista"                               
## [10] "condizione/mestiere/professione"           "falegname"                                 "annotazioni riportate sul fascicolo"      
## [13] "radiato"                                   "Unità archivistica"                       "busta"                                    
## [16] "1"                                         "estremi cronologici"                       "1905-1942"                                
## [19] "nel fascicolo è presente"                 "scheda biografica"                         "A00002"                                   

Do this after you're all done with Splash/ splashr to remove the running Docker container: 做到这一点你们都溅/完成后splashr删除运行多克尔容器:

killall_splash()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM