简体   繁体   中英

R XML package weird bug while parsing xml and html files

I am using R's XML package to extract all possible data over a wide variety of html and xml files. These files are basically documentation or build properties or readme file.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE chapter PUBLIC '-//OASIS//DTD DocBook XML V4.1.2//EN'
                      'http://www.oasis-open.org/docbook/xml/4.0 docbookx.dtd'>

<chapter lang="en">
<chapterinfo>
<author>
<firstname>Jirka</firstname>
<surname>Kosek</surname>
</author>
<copyright>
<year>2001</year>
<holder>Ji&rcaron;&iacute; Kosek</holder>
</copyright>
<releaseinfo>$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp $</releaseinfo>
</chapterinfo>
<title>Using XSL stylesheets to generate HTML Help</title>
<?dbhtml filename="htmlhelp.html"?>

<para>HTML Help (HH) is help-format used in newer versions of MS
Windows and applications written for this platform. This format allows
to pack several HTML files together with images, table of contents and
index into single file. Windows contains browser for this file-format
and full-text search is also supported on HH files. If you want know
more about HH and its capabilities look at <ulink
url="http://msdn.microsoft.com/library/tools/htmlhelp/chm/HH1Start.htm">HTML
Help pages</ulink>.</para>

<section>
<title>How to generate first HTML Help file from DocBook sources</title>

<para>Working with HH stylesheets is same as with other XSL DocBook
stylesheets. Simply run your favorite XSLT processor on your document
with stylesheet suited for HH:</para>

</section>

</chapter>

My goal is to just use xmlValue after parsing the tree using htmlTreeParse or xmlTreeParse using something like this (for xml files ..)

Text = xmlValue(xmlRoot(xmlTreeParse(XMLFileName)))

However, there is one error when I do this for both xml and html files. If there are child nodes at level 2 or more, the text fields get pasted without any space in between them.

For example, in the above example

xmlValue(chapterInfo) is

JirkaKosek2001JiKosek$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp 

The xmlValues of each child node (recursive) is pasted together without adding space between them. How can I get xmlValue to add a whitespace while extracting this data

Thanks a lot for your help in advance,

Shivani

According to the documentation, xmlValue only works on single text nodes, or on "XML nodes that contain a single text node". Spaces in non-text nodes are apparently not kept.

However, even in the case of a single text node, your code would strip the white spaces.

library(XML)
doc <- xmlTreeParse("<a> </a>")
xmlValue(xmlRoot(doc))
# [1] ""

You can add the ignoreBlanks=FALSE and useInternalNodes=TRUE arguments to xmlTreeParse , to keep all the whitespace.

doc <- xmlTreeParse(
  "<a> </a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] " "

# Spaces inside text nodes are preserved
doc <- xmlTreeParse(
  "<a>foo <b>bar</b></a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foo bar"

# Spaces between text nodes (inside non-text nodes) are not preserved
doc <- xmlTreeParse(
  "<a><b>foo</b> <b>bar</b></a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foobar"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM