简体   繁体   中英

Correcting XML closing tags

I have multiple XML docs that have been accidently malformed by leaving out the "/" from the closing tags. The tags are all matched pairs so we have <tagname> content <tagname> and so on in each doc. There is a hierarchy in the docs so we do have tags inside other tags (all open and closed the same way). The documents would be properly formed if the "/" was in the closing tags.

The question: What would be a reliable and 'easy' way to insert the "/" into the closing tags?

I'm comfortable working with Python (3), VB, VBA, C#, SQL, REGEX and so on. I'm hoping someone might already have encountered this scenario and has a REGEX that could be used.

There are approximately 2000 XML docs, all stored in a LONGTEXT field in a MySQL (8) database (InnoDB tables).

Any help or guidance greatly appreciated.

The Frog

There is no reliable and easy way of doing this in the general case. It needs a full recursive parser (one capable of handling a recursive grammar) and is beyond the capabilities of regular expressions.

If you know that the <tagname> element cannot be nested, then you know when you encounter a <tagname> within another <tagname> that it must be wrong, and should have been </tagname> . It might be possible to use a SAX parser, detect the nested startElement event, and substitute an endElement event. The SAX parser will eventually fail when it hits end of document, but by then you might have all the information you need.

If tags can be nested, then the problem becomes an order of magnitude harder, because you now need lookahead to know which start tags should have been end tags -- and even then, there's going to be an element of guesswork involved.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM