简体   繁体   English

更正 XML 结束标签

[英]Correcting XML closing tags

I have multiple XML docs that have been accidently malformed by leaving out the "/" from the closing tags.我有多个 XML 文档由于在结束标签中省略了“/”而意外格式错误。 The tags are all matched pairs so we have <tagname> content <tagname> and so on in each doc.标签都是匹配的对,所以我们在每个文档中都有<tagname>内容<tagname>等等。 There is a hierarchy in the docs so we do have tags inside other tags (all open and closed the same way).文档中有一个层次结构,因此我们在其他标签中确实有标签(所有打开和关闭的方式都相同)。 The documents would be properly formed if the "/" was in the closing tags.如果“/”在结束标记中,则文档将正确形成。

The question: What would be a reliable and 'easy' way to insert the "/" into the closing tags?问题:将“/”插入结束标签的可靠且“简单”的方法是什么?

I'm comfortable working with Python (3), VB, VBA, C#, SQL, REGEX and so on. I'm comfortable working with Python (3), VB, VBA, C#, SQL, REGEX and so on. I'm hoping someone might already have encountered this scenario and has a REGEX that could be used.我希望有人可能已经遇到过这种情况并且有一个可以使用的正则表达式。

There are approximately 2000 XML docs, all stored in a LONGTEXT field in a MySQL (8) database (InnoDB tables).大约有 2000 个 XML 文档,全部存储在 MySQL (8) 数据库(InnoDB 表)的 LONGTEXT 字段中。

Any help or guidance greatly appreciated.非常感谢任何帮助或指导。

The Frog青蛙

There is no reliable and easy way of doing this in the general case.在一般情况下,没有可靠且简单的方法可以做到这一点。 It needs a full recursive parser (one capable of handling a recursive grammar) and is beyond the capabilities of regular expressions.它需要一个完整的递归解析器(一个能够处理递归语法的)并且超出了正则表达式的能力。

If you know that the <tagname> element cannot be nested, then you know when you encounter a <tagname> within another <tagname> that it must be wrong, and should have been </tagname> .如果您知道<tagname>元素不能嵌套,那么当您在另一个<tagname>中遇到<tagname>时,您就会知道它一定是错误的,应该是</tagname> It might be possible to use a SAX parser, detect the nested startElement event, and substitute an endElement event.可以使用 SAX 解析器,检测嵌套的startElement事件,并替换endElement事件。 The SAX parser will eventually fail when it hits end of document, but by then you might have all the information you need. SAX 解析器在到达文档结尾时最终会失败,但到那时您可能已经拥有了您需要的所有信息。

If tags can be nested, then the problem becomes an order of magnitude harder, because you now need lookahead to know which start tags should have been end tags -- and even then, there's going to be an element of guesswork involved.如果标签可以嵌套,那么问题就会变得困难一个数量级,因为您现在需要提前知道哪些开始标签应该是结束标签——即使这样,也会涉及到猜测的元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM