简体   繁体   中英

R xpath getnodeset “matches” command

I have an xml file.

<?xml version="1.0" encoding="UTF-8"?> <doc>
  <!-- A comment -->
  <a xmlns="http://www.tei-c.org/ns/1.0">
    <w>word
    </w>
    <w>wording
    </w>
</a>
</doc>

I would like to return nodes containing "word" but not "wording".

library(XML) # I have nothing against using library(xml2) or library(xml2r) instead
test2 <- xmlParse("file.xml", encoding="UTF-8")
x <- c(x="http://www.tei-c.org/ns/1.0")

# starts-with seems to find the words just fine
test1 <- getNodeSet(doc, "//x:w[starts-with(., 'word')]", x)
# but R doesn't seem to allow "matches" to be included
# in the xpath query, hence none of the following work:
test1 <- getNodeSet(doc, "//x:w[[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[@*[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[matches(., '^word$')]", x)
test1 <- getNodeSet(doc, "//x:w[@*[matches(., '^word$')]]", x)

Update: If I use the term matches with any combination I get the following error and an empty list as result.

xmlXPathCompOpEval: function matches not found
XPath error : Unregistered function
XPath error : Invalid expression
XPath error : Stack usage error
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //x:w[matches(., '^word$')]

If I look for "//x:w[@*[contains(., '^word$')]]" based on advice below, I get the following warning and empty list as result:

Warning message:
In xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  :
  the XPath query has no namespace, but the target document has a default namespace. 
 This is often an error and may explain why you obtained no results

I imagine I am just using the wrong commands. What should I change to make it work? Thanks!

Thanks for updating your question to include the error message. It's like going to a doctor and asking for treatment to solve your problem -- you definitely want to let him know what specific symptoms you've noticed!

And this error message confirms that the match() function is missing. That indicates that R (at least, the version you're using) uses XPath 1.0, which does not have match() or other regular expression features. BaseX, on the other hand, supports XPath 2.0 (in fact it supports XPath 3.0, IIRC), so it can handle match() .

Regarding how to do what you want in XPath 1.0, it's not entirely clear what you'd like to do. You mentioned using word boundary markers, so you could try something like

getNodeSet(doc, "//x:w[contains(normalize-space(concat(' ', ., ' ')),
                                ' word ')]", x)

This will select <w> elements whose content includes word at the beginning and/or end of the text, or preceded/followed by whitespace. If you want to treat certain non-whitespace characters as word boundaries, you could translate them to whitespace using translate() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM