简体   繁体   中英

Parsing a tag containing special characters with “xml2” in R

I'm using the xml2 package in R to parse my xml file. Everything works perfectly, except this one tag, that has a dash in the tag name.

XML Sample:

<?xml version="1.0" encoding="UTF-8"?>
<abstracts-retrieval-response xmlns="http://www.elsevier.com/xml/svapi/abstract/dtd" xmlns:ait="http://www.elsevier.com/xml/ani/ait" xmlns:ce="http://www.elsevier.com/xml/ani/common" xmlns:cto="http://www.elsevier.com/xml/cto/dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <coredata>
    <prism:url>http://api.elsevier.com/content/abstract/scopus_id/85011891272</prism:url>
    <dc:identifier>SCOPUS_ID:85011891272</dc:identifier>
    <eid>2-s2.0-85011891272</eid>
    <prism:doi>10.1186/s13638-017-0812-8</prism:doi>
    <article-number>29</article-number>
    <dc:title>Performance of emerging multi-carrier waveforms for 5G asynchronous communications</dc:title>
    <prism:aggregationType>Journal</prism:aggregationType>
    <srctype>j</srctype>
    <citedby-count>0</citedby-count>
    <prism:publicationName>Eurasip Journal on Wireless Communications and Networking</prism:publicationName>
    <dc:publisher> Springer International Publishing </dc:publisher>
    <source-id>18202</source-id>
    <prism:issn>16871499</prism:issn>
    <prism:volume>2017</prism:volume>
    <prism:issueIdentifier>1</prism:issueIdentifier>
    <prism:coverDate>2017-12-01</prism:coverDate>
 </coredata>
</abstracts-retrieval-response>

I'm using this line of code to extract the text inside the prism:doi node (works as intended):

xml2::xml_text(xml2::xml_find_first(intermediateXML,"//prism:doi"))

The same code to extract the value of "citedby-count" however does return "NA" instead of the real value.

xml2::xml_text(xml2::xml_find_first(intermediateXML,"//citedby-count"))

my guess is, that the parser is confused with the "-" inside the tag. Is there away to avoid this problem?

Did you try updating xml2 ? On my Mac using xml2 version 1.1.1 it works:

doc <- read_xml(txt) %>% 
  xml_find_first("/coredata")

doc %>% xml_find_first("citedby-count") %>% xml_text # "0"
doc %>% xml_find_first("//citedby-count") %>% xml_text # "0"

If this doesn't work you might try to specify the NS as

doc %>% xml_find_first("citedby-count", ns = character()) %>% xml_text

Data and Packages

require(xml2)
require(magrittr)
txt <- '<coredata>
    <prism:url>http://api.elsevier.com/content/abstract/scopus_id/85011891272</prism:url>
<dc:identifier>SCOPUS_ID:85011891272</dc:identifier>
<eid>2-s2.0-85011891272</eid>
<prism:doi>10.1186/s13638-017-0812-8</prism:doi>
<article-number>29</article-number>
<dc:title>Performance of emerging multi-carrier waveforms for 5G asynchronous communications</dc:title>
<prism:aggregationType>Journal</prism:aggregationType>
<srctype>j</srctype>
<citedby-count>0</citedby-count>
<prism:publicationName>Eurasip Journal on Wireless Communications and Networking</prism:publicationName>
<dc:publisher> Springer International Publishing </dc:publisher>
<source-id>18202</source-id>
<prism:issn>16871499</prism:issn>
<prism:volume>2017</prism:volume>
<prism:issueIdentifier>1</prism:issueIdentifier>
<prism:coverDate>2017-12-01</prism:coverDate></coredata>'

I could not solve the problem the way I intended. In the end I worked my way around by using the xml2::as_list function and selecting the element through

intermediateXML <- xml2::read_xml(serverResponse)
listXML <- xml2::as_list(intermediateXML)

listXML$coredata$`citedby-count`[[1]]

Thanks a lot @Floo0

Arriving late on this scene. Here is a solution I found that may be helpful to others:

doc %>% xml_find_all( "//*[name()='my-dash-tag']" )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM