I used the script below to try to extract the data from a HTML file converted from PDF.
temp.html <- scan(file=filename,what="character")
pagetree <- htmlTreeParse(temp.html, error=function(...){}, useInternalNodes = TRUE)
tx.raw <- getNodeSet(pagetree,"//div")
The tx.raw
create a list and one of them is shown as below:
tx[[170]]
[[170]]
<div style="position:absolute;top:985;left:748">
<nobr>
<span class="ft03">
971.72
</span>
</nobr>
</div>
The information I need is inside span
(ie 971.72
), but I also need to style
in div
to let me know where exactly the piece is data in span
is located in the pdf file. How can I extract the style information also? Thanks.
I would do that with a simple regexp :
sub('.*style="([0-9a-z;:]*)".*', '\\1', t)
Where t
holds the corresponding HTML part as text.
Lengthy example based on your demo HTML part:
## loading your demo HTML part to one line
t <- paste(readLines(textConnection('<div style="position:absolute;top:985;left:748">
<nobr>
<span class="ft03">
971.72
</span>
</nobr>
</div>')), collapse = '')
## let us extract some parts!
library(XML)
t.html <- htmlTreeParse(t, useInternalNodes = TRUE)
t.val <- xpathApply(t.html, '//div', xmlValue)
t.val <- gsub('\\s', '', t.val)
t.style <- sub('.*style="([0-9a-z;:]*)".*', '\\1', t)
Depending on how you parse the HTML before, most of the above lines can be eliminated - of course.
Results:
> t.val
[1] "971.72"
> t.style
[1] "position:absolute;top:985;left:748"
Extracting top
and left
could be addressed similarly, I've just not deal with it as I am not sure if eg left
and top
are static strings.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.