简体   繁体   中英

extract “style” information when scraping data in html using XML in R

I used the script below to try to extract the data from a HTML file converted from PDF.

temp.html <- scan(file=filename,what="character")
pagetree <- htmlTreeParse(temp.html, error=function(...){}, useInternalNodes = TRUE)
tx.raw <- getNodeSet(pagetree,"//div")

The tx.raw create a list and one of them is shown as below:

tx[[170]]

[[170]]
<div style="position:absolute;top:985;left:748">
  <nobr>
    <span class="ft03"> 




971.72
 </span>
  </nobr>
</div> 

The information I need is inside span (ie 971.72 ), but I also need to style in div to let me know where exactly the piece is data in span is located in the pdf file. How can I extract the style information also? Thanks.

I would do that with a simple regexp :

sub('.*style="([0-9a-z;:]*)".*', '\\1', t)

Where t holds the corresponding HTML part as text.


Lengthy example based on your demo HTML part:

## loading your demo HTML part to one line
t <- paste(readLines(textConnection('<div style="position:absolute;top:985;left:748">
  <nobr>
    <span class="ft03">




971.72
 </span>
  </nobr>
</div>')), collapse = '')

## let us extract some parts!
library(XML)
t.html <- htmlTreeParse(t, useInternalNodes = TRUE)
t.val <- xpathApply(t.html, '//div', xmlValue)
t.val <- gsub('\\s', '', t.val)
t.style <- sub('.*style="([0-9a-z;:]*)".*', '\\1', t)

Depending on how you parse the HTML before, most of the above lines can be eliminated - of course.

Results:

> t.val
[1] "971.72"
> t.style
[1] "position:absolute;top:985;left:748"

Extracting top and left could be addressed similarly, I've just not deal with it as I am not sure if eg left and top are static strings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM