Parsing XML getting comment(s) and date value(s) only

Question

Hey all I am trying to see if I can read an XML file and only gather the tags that have the date formatted like YYYY-MM-DD.

Here is an online example: https://repl.it/repls/MedicalIgnorantEfficiency

Here is an example of my xml to parse:

<?xml version="1.0" encoding="UTF-8"?>
<ncc:Message xmlns:ncc="http://blank/1.0.6" 
xmlns:cs="http://blank/1.0.0" 
xmlns:jx="http://blank/1.0.0"
xmlns:jm="http://blank/1.0.0"
xmlns:n-p="http://blank/1.0.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://blank/1.0.6/person person.xsd">
    <ncc:DataSection>
        <ncc:PersonResponse>
            <!-- Message -->
            <cs:CText cs:type="No">NO WANT</cs:CText>
            <jm:CaseID>
                <!-- OEA -->
                <jm:ID>ABC123</jm:ID>
            </jm:CaseID>
            <jx:PersonName>
                <!-- NAM -->
                <jx:GivenName>Arugula</jx:GivenName>
                <jx:MiddleName>Pibb</jx:MiddleName>
                <jx:SurName>Atari</jx:SurName>
            </jx:PersonName>
            <!-- DOB -->
            <ncc:PersonBirthDateText>1948-05-11</ncc:PersonBirthDateText>
            <jx:PersonDetails>
                <!-- SXC -->
                <jx:PersonSSN>
                    <jx:ID/>
                </jx:PersonSSN>
            </jx:PersonDetails>
            <n-p:Activity>
                <!--DOZ-->
                <jx:ActivityDate>1996-04-04</jx:ActivityDate>
                <jx:HomeAgency xsi:type="cs:Organization">
                    <!-- ART -->
                    <jx:Organization>
                        <jx:ID>ZR5981034</jx:ID>
                    </jx:Organization>
                </jx:HomeAgency>
            </n-p:Activity>
            <jx:PersonName>
                <!-- DOB Newest -->
                <ncc:BirthDateText>1993-05-12</ncc:BirthDateText>
                <ncc:BirthDateText>1993-05-13</ncc:BirthDateText>
                <ncc:BirthDateText>1993-05-14</ncc:BirthDateText>
                <jx:IDDetails xsi:type="cs:IDDetails">
                    <!-- SMC Checker -->
                    <jx:SSNID>
                        <jx:ID/>
                    </jx:SSNID>
                </jx:IDDetails>
            </jx:PersonName>
        </ncc:PersonResponse>
    </ncc:DataSection>
</ncc:Message>

I am looking to want to get the date value(s) and the comment above those date values . So something like this for the example xml above:

Comment: < !-- DOB --> (ncc:DataSection/ncc:PersonResponse)

Date: 1948-05-11 (ncc:DataSection/ncc:PersonResponse/ncc:PersonBirthDateText)

.

Comment: < !-- DOZ --> (ncc:DataSection/ncc:PersonResponse/np:Activity)

Date: 1996-04-04 (ncc:DataSection/ncc:PersonResponse/np:Activity/jx:ActivityDate)

.

Comment: < !-- DOB Newest --> (ncc:DataSection/ncc:PersonResponse/jx:PersonName)

Date:

 1993-05-12 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText) 1993-05-13 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText) 1993-05-14 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)

The code I am trying to do this with is:

public static void xpathNodes() throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
    File file = new File(base_);
    XPath xPath = XPathFactory.newInstance().newXPath();
    //String expression = "//*[not(*)]";
    String expression = "([0-9]{4})-([0-9]{2})-([0-9]{2})";
    DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = builderFactory.newDocumentBuilder();
    Document document = builder.parse(file);
    document.getDocumentElement().normalize();
    NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(document, XPathConstants.NODESET);

    for (int i = 0; i < nodeList.getLength(); i++) {
        System.out.println(getXPath(nodeList.item(i)));
    }
}

private static String getXPath(Node node) {
    Node parent = node.getParentNode();

    if (parent == null) {
        return node.getNodeName();
    }

    return getXPath(parent) + "/" + node.getNodeName();
}

public static void main(String[] args) throws Exception {
    xpathNodes();
}

I know the Regex (([0-9]{4})-([0-9]{2})-([0-9]{2})) works as I have used it in Notepad++ and it works just fine there finding the dates within the opened xml file.

I am currently getting the error:

Exception in thread "main" javax.xml.transform.TransformerException: A location path was expected, but the following token was encountered: [

This doesn't even take in consideration the comments yet.

Any help would be great!

Answer 1

You have supplied a Regex expression to an API that expects an XPath expression.

You can use regular expressions with XPath but you will need a processor that supports XPath 2.0 or later (for example Saxon). The XPath processor that comes with the JDK still only supports the ancient XPath 1.0 standard, which has no regex support.

You can't supply a regex directly to xpath.compile() , but you can supply an XPath expression of the form //*[matches(., '--my regex--')] .

If you do decide to go down the Saxon route, I would recommend using Saxon's internal tree model rather than DOM, as this executes XPath typically five to ten times faster than DOM.

Answer 2

For an XPath 1.0 expression without RegEx you might well use:

//*[string-length()=10]
   [number(substring(.,1,4))=substring(.,1,4)]
   [substring(.,5,1)='-']
   [number(substring(.,6,2))=substring(.,6,2)]
   [substring(.,8,1)='-']
   [number(substring(.,9,2))=substring(.,9,2)]
|
//*[string-length()=10]
   [number(substring(.,1,4))=substring(.,1,4)]
   [substring(.,5,1)='-']
   [number(substring(.,6,2))=substring(.,6,2)]
   [substring(.,8,1)='-']
   [number(substring(.,9,2))=substring(.,9,2)]
   /preceding-sibling::node()[normalize-space()][1][self::comment()]

Do note: there is some duplicated expression because you wanted to select elements and comments nodes. The expression use the well known idiom for number testing. Finally and because there is no guarantee about the parser setting for white space only text nodes, before the position predicated the normalize-space() function is used.

Test in here

Edit : enforcing string length.

Parsing XML getting comment(s) and date value(s) only

Question

2 answers

solution1
1 2020-01-14 20:26:09

solution2
1 ACCPTED 2020-01-14 20:47:45

Parsing XML getting comment(s) and date value(s) only

Question

2 answers

solution1 1 2020-01-14 20:26:09

solution2 1 ACCPTED 2020-01-14 20:47:45

solution1
1 2020-01-14 20:26:09

solution2
1 ACCPTED 2020-01-14 20:47:45