简体   繁体   中英

Web scraping in R using rvest

I am trying to extract the text from a paragraph under the heading "Operaciones de seguro". I have located it in the source code, but I can't figure out what to put in the html_node. I have so far:

arg_taxr <- html("http://www.afip.gob.ar/futCont/otros/sistemaTributarioArgentino/")
arg_taxr %>%
html_node("strong.(u)")
html_text()%>%

The source code is:

<br />
<a name="u" id="u"></a><br />
<strong>Operaciones de Seguro.</strong><br />
<br />
Son de fuente argentina las ganancias provenientes de operaciones de seguros que cubran riesgos sobre bienes situados en Argentina o en relación con personas residentes de Argentina. <br />
<br />
Las sumas originadas como indemnizaciones de primas de seguro generalmente se consideran como compensación por una pérdida de capital del beneficiario y no se encuentran sujetas al gravamen. No obstante, los excedentes de las primas de seguro sobre el costo de los activos perdidos (menos el valor del bien recuperado) se consideran ganancia imponible<br />



The data you are looking for is not contained in an html element, so it's very hard to extract. I wasn't able to use rvest... but here's what I came up with. Basically, look for that header wrapped in strong tags, and go until the volver (back) link which indicates the end of the section. Collapse it down via paste

a <- readLines("http://www.afip.gob.ar/futCont/otros/sistemaTributarioArgentino")
where_strong <- which(grepl("<strong>Operaciones de Seguro.</strong>", a))
a <- a[where_strong+1 : length(a)]
where_volver <- which(grepl("volver", a))
a <- a[1 : where_volver[1]-1]
a <- a[a!="<br />" & !is.na(a)]
a <- paste0(a, collapse="\n")
a <- gsub("<br />", "", a)

Your frustration is understandable. The HTML on this page is quite challenging to parse. I spent a while but only got this far. The node that would stand out as the target is greyed out in my Firebug view -- -- so I don't know how to reach it.

Perhaps if you can count the span class = 'contenido's and find the right one, then extract the text that is preceded by a line break (see the next line of code below), perhaps you can extract the text you need.

doc <- htmlTreeParse("http://www.afip.gob.ar/futCont/otros/sistemaTributarioArgentino/", useInternal = TRUE)

xpathSApply(doc, "//span[@class = 'contenido']", xmlValue) 
 [1] "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]"
[10] "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]" "[ volver ]"

xpathSApply(doc, "//span[@class = 'contenido']//text()[preceding-sibling::br]", xmlValue)

The code above derived from finding text between /br tags

For what it's worth, I tried the following to no avail>

xpathSApply(doc, "//span[@class = 'contenido']//strong[contains(text(), 'Operaciones de Seguro.')]", xmlValue) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM