简体   繁体   中英

get elements by tag name in xml parsing, excluding children of some parents

I have a xml file which I am parsing. Though some of the tag names happened to occur multiple times, under different parent name. I know which parent's child I want to ignore. How can I do that?

 <sub-article id="S01" article-type="translation" xml:lang="pt">
  <front-stub>
     <article-categories>
        <subj-group subj-group-type="heading">
           <subject>Artigos Originais</subject>
        </subj-group>
     </article-categories>
     <title-group>
        <article-title>
           Prevalência de deficiência nutricional em pacientes com
            tuberculose pulmonar
           <xref ref-type="fn" rid="fn02">*</xref>
        </article-title>
     </title-group>
   </front-stub>
  </article-categories>
 </sub-article>        
    .....
    .....
 <article-meta>
     <article-id pub-id-type="pmid">24068270</article-id>
     <article-id pub-id-type="pmc">4075858</article-id>
     <article-id pub-id-type="publisher-id">S1806-37132013000400012</article-id>
     <article-id pub-id-type="doi">10.1590/S1806-37132013000400012</article-id>
     <article-categories>
        <subj-group subj-group-type="heading">
           <subject>Original Articles</subject>
        </subj-group>
     </article-categories>
     <title-group>
        <article-title>
           Prevalence of nutritional deficiency in patients with
           pulmonary tuberculosis
           <xref ref-type="fn" rid="fn01">*</xref>
        </article-title>
     </title-group>
    <article-meta>

In this example, I dont want to process the children under sub-article tag. So, "article-title" would be processed only for "Prevalence of nutritional deficiency in patients with pulmonary tuberculosis", not "Prevalência de deficiência nutricional em pacientes com tuberculose pulmonar"

I am currently following code, which returns me all the nodes having name "title-group. How can I make it specific so I dont get it from certain parent.

NodeList titleNodeList = document.getElementsByTagName("title-group");

Just search for "title-group" nodes under "sub-article" nodes:

List<Node> allTitleGroupNodes = new ArrayList<>();
NodeList subArticleNodes = document.getElementsByTagName("sub-article");
for (int i = 0; i < subArticleNodes.getLength(); i++) {
    NodeList titleNodes = subArticleNodes.item(i).getElementsByTagName("title-group");
    for (int j = 0; j < titleNodes.getLength(); j++) {
        allTitleGroupNodes.add(titleNodes.item(j));
    }
}

(Aside: The horrible interface of NodeList is one of the things I hate most about processing XML in standard Java.)

There're two ways to achieve it using XPath:

  1. Include the target element name <article-meta>
  2. Exclude the target element name <sub-article>

Personally I prefer the 1st one since it's more explicit and always works faced to different XML files.

Solution 1 Inclusion

Use XPath to select elements only of they're under <article-meta> :

//article-meta//title-group

Java:

XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPath.compile("//article-meta//title-group");
NodeList titleNodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);

Solution 2 Exclusion

Use XPath to exclude elements if they're under <sub-article> . I assume that the XML root element is <article> (please justify the code if it's not the case):

/article/*[not(self::sub-article)]//title-group

Java

XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPath.compile("/article/*[not(self::sub-article)]//title-group");
NodeList titleNodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM