简体   繁体   中英

XPath and namespace specification for XML documents with an explicit default namespace

I'm struggling to get the correct combination of an XPath expression and the namespace specification as required by package XML (argument namespaces ) for a XML document that has an explicit xmlns namespace defined at the top element.

UPDATE

Thanks to har07 I was able to put it together:

Once you query the namespaces, the first entry of ns has no name yet and that's the problem:

nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))

> ns
                                             omegahat                          r 
    "http://something.org"  "http://www.omegahat.org" "http://www.r-project.org" 

So we'll just assign a name that serves as a prefix (this can be any valid R name):

names(ns)[1] <- "xmlns"

Now all we have to do is using that default namespace prefix everywhere in our XPath expressions:

getNodeSet(doc, "/xmlns:doc//xmlns:b[@omegahat:status='foo']", ns)

For those interested in alternative solutions based on name() and namespace-uri() (amongst others) might find this post helpful.


Just for the sake of reference: this was the trial-and-error code before we came to the solution:

Consider the example from ?xmlParse :

require("XML")

doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))

> doc
<?xml version="1.0"?>
<doc>
  <!-- A comment -->
  <a xmlns:omegahat="http://www.omegahat.org" xmlns:r="http://www.r-project.org">
    <b>
      <c>
        <b/>
      </c>
    </b>
    <b omegahat:status="foo">
      <r:d>
        <a status="xyz"/>
        <a/>
        <a status="1"/>
      </r:d>
    </b>
  </a>
</doc>
nsDefs <- xmlNamespaceDefinitions(getNodeSet(doc, "/doc/a")[[1]])
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
getNodeSet(doc, "/doc//b[@omegahat:status='foo']", ns)[[1]]

In my document, however, the namespaces are already defined in <doc> tag, so I adapted the example XML code accordingly:

xml_source <- c(
  "<?xml version=\"1.0\"?>",
  "<doc xmlns:omegahat=\"http://www.omegahat.org\" xmlns:r=\"http://www.r-project.org\">",
  "<!-- A comment -->",
  "<a>",
  "<b>",
  "<c>",
  "<b/>",
  "</c>",
  "</b>",
  "<b omegahat:status=\"foo\">",
  "<r:d>",
  "<a status=\"xyz\"/>",
  "<a/>",
  "<a status=\"1\"/>",
  "</r:d>",
  "</b>",
  "</a>",
  "</doc>"
)
write(xml_source, file="exampleData_2.xml")  
doc <- xmlParse("exampleData_2.xml")
nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))    
getNodeSet(doc, "/doc", namespaces = ns)
getNodeSet(doc, "/doc//b[@omegahat:status='foo']", namespaces = ns)[[1]]  

Everything still works fine. What's more, though, is that my XML code additionally has an explicit definition of the default namespace ( xmlns ):

xml_source <- c(
  "<?xml version=\"1.0\"?>",
  "<doc xmlns=\"http://something.org\" xmlns:omegahat=\"http://www.omegahat.org\" xmlns:r=\"http://www.r-project.org\">",
  "<!-- A comment -->",
  "<a>",
  "<b>",
  "<c>",
  "<b/>",
  "</c>",
  "</b>",
  "<b omegahat:status=\"foo\">",
  "<r:d>",
  "<a status=\"xyz\"/>",
  "<a/>",
  "<a status=\"1\"/>",
  "</r:d>",
  "</b>",
  "</a>",
  "</doc>"  
)
write(xml_source, file="exampleData_3.xml")  
doc <- xmlParse("exampleData_3.xml")
nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))

What used to work fails now:

> getNodeSet(doc, "/doc", namespaces = ns)
list()
attr(,"class")
[1] "XMLNodeSet"
Warning message:
using http://something.org as prefix for default namespace http://something.org 

> getNodeSet(doc, "/xmlns:doc", namespaces = ns)
XPath error : Undefined namespace prefix
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression /xmlns:doc
In addition: Warning message:
using http://something.org as prefix for default namespace http://something.org 
getNodeSet(doc, "/xmlns:doc", 
  namespaces = matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs)
)

This seems to get me closer:

> getNodeSet(doc, "/xmlns:doc",
+ namespaces = matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs)
+ )[[1]]
<doc xmlns="http://something.org" xmlns:omegahat="http://www.omegahat.org" xmlns:r="http://www.r-project.org">
  <!-- A comment -->
  <a>
    <b>
      <c>
        <b/>
      </c>
    </b>
    <b omegahat:status="foo">
      <r:d>
        <a status="xyz"/>
        <a/>
        <a status="1"/>
      </r:d>
    </b>
  </a>
</doc> 

attr(,"class")
[1] "XMLNodeSet"

Yet, now I don't know how to proceed in order to get to the children nodes:

> getNodeSet(doc, "/xmlns:doc//b[@omegahat:status='foo']", ns)[[1]]
XPath error : Undefined namespace prefix
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression /xmlns:doc//b[@omegahat:status='foo']
In addition: Warning message:
using http://something.org as prefix for default namespace http://something.org 

> getNodeSet(doc, "/xmlns:doc//b[@omegahat:status='foo']",
+ namespaces = c(
+ matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs),
+ matchNamespaces(doc, namespaces="omegahat", nsDefs = nsDefs)
+ )
+ )
list()
attr(,"class")
[1] "XMLNodeSet"

Namespace definition without prefix ( xmlns="..." ) is default namespace. In case of XML document having default namespace, the element where default namespace declared and all of it's descendant without prefix and without different default namespace declaration are considered in that aforementioned default namespace.

Therefore, in your case you need to use prefix registered for default namespace at the beginning of all elements in the XPath, for example :

/xmlns:doc//xmlns:b[@omegahat:status='foo']

UPDATE :

Actually I'm not a user of r , but looking at some references on net something like this may work :

getNodeSet(doc, "/ns:doc//ns:b[@omegahat:status='foo']", c(ns="http://something.org"))

I think @HansHarhoff provides a very good solution.

For anybody else still searching for a solution, in my experience I think the following works more generally since a single XML document can have multiple namespaces.

doc <- xmlInternalTreeParse(xml_source)

ns <- getDefaultNamespace(doc)[[1]]$uri
names(ns)[1] <- "xmlns"

getNodeSet(doc, "//xmlns:Parameter", namespaces = ns)

I had a similar issue, but in my case I did not care about the namespace and would like a solution that ignored the namespace.

Assume that we have the following XML in the variable myxml:

<root xmlns="uri:someuri.com:schema">
<Parameter>Test
</Parameter>
</root>

In R we want to read this so we run:

myxml <- '
<root xmlns="uri:someuri.com:schema">
  <Parameter>Test
</Parameter>
</root>
'
myxmltop <- xmlParse(myxml)
ns <- xmlNamespaceDefinitions(myxmltop, simplify =  TRUE)

Here I have simplified Rappster's code by using the simplify=TRUE parameter. Now we can add the name/prefix of the namespace as in Rappster's code:

names(ns)[1] <- "xmlns"

Now we can refer to this namespace by:

getNodeSet(myxmltop, "//xmlns:Parameter", namespaces =ns)

Simpler solution (ignoring namespaces)

We can also be more flexible by matching on any namespace by doing:

myxmltop <- xmlParse(myxml)
getNodeSet(myxmltop, "//*[local-name() = 'Parameter']")

This solution was inspired by this SO answer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM