简体   繁体   中英

Validate and remove any extraneous closing tags in xml in java

Example:

<Module name="IOWData">
                *</VERSION>*
                <ACQ>           PAR     </ACQ>
                <RECON>         PUP     </RECON>
            <Group name="PAR">
                <HEALTHSTATUS>          OK      </HEALTHSTATUS>
            </Group>
</Module>

I want to remove any extraneous closing tag ie a closing tag which hasn't be opened in the xml (as shown in the example - version tag ).

Note: It can be any tag anywhere throughout the xml. Also the xml is huge I don't really wish to load the entire xml in memory.

Following ideas I have:

  1. Regex : If I can use regular expression to solve this. But I need help in how to check the tag name for closing and opening check.

  2. Using XSD . But how ?

Hope I'm clear and yearning for an efficient solution. Thanks!

First, don't call it XML. It isn't XML. If you start by calling it non-XML, that will help to establish the mindset that tools designed for processing XML aren't going to be any use to you.

Given that you have to parse a language that isn't XML, and that no parser for that language currently exists, you're going to have to learn about writing parsers[*]. It's a topic that is covered in every computer science course and in any compiler textbook, but it's not something to attempt until you have read a bit about the theory.

Once you know how to start writing a parser, the best thing to do is to write down the BNF of the grammar you want to parse, which is basically the XML grammar plus the option of stray end-tags. You will have a lexical analyser which identifies the tags (including the strays) and pushes them across to a syntax analyzer, which can do the job of matching tag names (although this is technically, in the jargon of compiler writing, semantics rather than syntax). Then you just have to identify the strays and drop them from the event stream passed to the next stage of processing, which can be a standard SAX ContentHandler.

I hope that gives you an accurate feel for the size of the mountain you are wanting to climb.

[*]I guessed that you don't know much about this from the fact that you thought regular expressions might do the job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM