[英]Automatically correct invalid XML?
I am currently using SSIS on a project where I need to verify the correct XML file structure.我目前在需要验证正确的 XML 文件结构的项目上使用 SSIS。 In particular, I have to check that there is no missing tag in the XML file and if so, I have to reassemble this line without tag.
特别是,我必须检查 XML 文件中是否没有丢失标签,如果是,我必须重新组装没有标签的这一行。 I'll give you an example to better understand.
我会给你一个例子来更好地理解。
<?xml version="1.0"?>
<catalog>
<DATA>0000000061E82D821590010000409525CD</DATA>
<DATA>0000000061E82D8C163001000140AD0DF6</DATA>
<DATA>0000000061E82D9616E301000240776CAB</DATA>
<DATA> 0000000061E82DA0178001000340C56B6</DATA>
<DATA>0000000061E82DAA188001000440C0C7CB</DATA>
0000000061E82DDAEA4001000540BB9A276
</catalog>
For example in the above XML there is a <DATA>
tag missing.例如在上面的 XML 中缺少一个
<DATA>
标签。 I have no influence on the creation of the XML. How could I notice that a <DATA>
tag is missing (the number of data lines is not fixed), and subsequently retrieve that line where there is no tag?我对 XML 的创建没有影响。我怎么会注意到缺少
<DATA>
标记(数据行数不固定),然后检索没有标记的那一行?
For example in the above xml there is a <DATA>
tag missing.例如在上面的 xml 中缺少一个
<DATA>
标签。 I have no influence on the creation of the XML.我对 XML 的创建没有影响。
The solution can be a suite of SSIS components or a c# script.该解决方案可以是一套 SSIS 组件或 c# 脚本。
It is impossible to automatically correct invalid XML in the general case.一般情况下不可能自动更正无效的XML。
Terminology correction术语更正
For example in the above XML there is a
<DATA>
tag missing.例如在上面的 XML 中缺少一个
<DATA>
标签。
There is not a <DATA>
tag missing.没有缺少
<DATA>
标记。 You probably mean that there are supposed to be begin and end DATA
tags surrounding 0000000061E82DDAEA4001000540BB9A276
.您可能是说
0000000061E82DDAEA4001000540BB9A276
周围应该有开始和结束DATA
标签。 The difference is significant because if there were only a single tag missing, the "XML" would not be well-formed .差别很大,因为如果只缺少一个标签,“XML”就不会是格式良好的。 If a schema says that a
catalog
element may only have DATA
children, then the XML is not valid .如果模式表明
catalog
元素可能只有DATA
子元素,则 XML无效。
See Well-formed vs Valid XML for a detailed description of this important distinction.有关此重要区别的详细说明,请参阅格式良好与有效 XML 。
Don't try to automatically correct invalid XML不要尝试自动更正无效的 XML
Best practice is to reject the input and force the sender/creator to fix the document.最佳做法是拒绝输入并强制发件人/创建者修复文档。 The entire raison d'être for a schema is to express the invariants that can be relied upon to process the data.
模式的全部存在理由是表达可依赖于处理数据的不变量。 Violating those invariants means all bets are off.
违反这些不变量意味着所有的赌注都被取消了。
Don't be seduced by the superficial simplicity of peep-hole repair ideas不要被窥视孔修复想法的表面简单所诱惑
Every repair idea implies an assumption about the data that is not expressed in the schema, which would be bad because:每个修复想法都意味着对模式中未表达的数据的假设,这很糟糕,因为:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.