extract “style” information when scraping data in html using XML in R

Question

I used the script below to try to extract the data from a HTML file converted from PDF.

temp.html <- scan(file=filename,what="character")
pagetree <- htmlTreeParse(temp.html, error=function(...){}, useInternalNodes = TRUE)
tx.raw <- getNodeSet(pagetree,"//div")

The tx.raw create a list and one of them is shown as below:

tx[[170]]

[[170]]
<div style="position:absolute;top:985;left:748">
  <nobr>
    <span class="ft03"> 




971.72
 </span>
  </nobr>
</div>

The information I need is inside span (ie 971.72 ), but I also need to style in div to let me know where exactly the piece is data in span is located in the pdf file. How can I extract the style information also? Thanks.

Answer 1

I would do that with a simple regexp :

sub('.*style="([0-9a-z;:]*)".*', '\\1', t)

Where t holds the corresponding HTML part as text.

Lengthy example based on your demo HTML part:

## loading your demo HTML part to one line
t <- paste(readLines(textConnection('<div style="position:absolute;top:985;left:748">
  <nobr>
    <span class="ft03">




971.72
 </span>
  </nobr>
</div>')), collapse = '')

## let us extract some parts!
library(XML)
t.html <- htmlTreeParse(t, useInternalNodes = TRUE)
t.val <- xpathApply(t.html, '//div', xmlValue)
t.val <- gsub('\\s', '', t.val)
t.style <- sub('.*style="([0-9a-z;:]*)".*', '\\1', t)

Depending on how you parse the HTML before, most of the above lines can be eliminated - of course.

Results:

> t.val
[1] "971.72"
> t.style
[1] "position:absolute;top:985;left:748"

Extracting top and left could be addressed similarly, I've just not deal with it as I am not sure if eg left and top are static strings.

extract “style” information when scraping data in html using XML in R

Question

1 answers

solution1
0 ACCPTED 2012-04-28 08:13:31

extract “style” information when scraping data in html using XML in R

Question

1 answers

solution1 0 ACCPTED 2012-04-28 08:13:31

solution1
0 ACCPTED 2012-04-28 08:13:31