简体   繁体   English

在处理无效的XML之前编码特殊字符

[英]Encoding Special Characters Before Processing Invalid XML

I have some invalid XML from a vendor that I need to process. 我有一些需要处理的来自供应商的无效XML。 Here is an example: 这是一个例子:

<a>foo</a>
<b>bar</b>
<c>foobar is < $15</c>

So, we have a few problems. 因此,我们有一些问题。 First, there is no root document. 首先,没有根文档。 I overcome that by adding a root document. 我通过添加根文档克服了这一问题。 No problem. 没问题。 The second, and more difficult problem, is the less than symbol. 第二个也是更困难的问题是小于符号。 I can just encode the whole thing but it will encode the XML tags. 我可以对整个内容进行编码,但是它将对XML标签进行编码。 Is there a library or simple method out there somewhere for handling this? 是否有某个地方的库或简单方法可以处理此问题? I really don't want to reinvent the wheel as I'm sure hundreds of people have dealt with "quasi-XML" like this. 我真的不想彻底改变这个想法,因为我敢肯定,数百人已经在处理“准XML”这样的问题。 Appreciate any help. 感谢任何帮助。

I would read the file line by line and use a regex to get the values between the nodes. 我会逐行读取文件,并使用正则表达式来获取节点之间的值。 Your example doesn't have nested elements so this is pretty easy. 您的示例没有嵌套元素,因此非常简单。 While reading line by line you can replace encode the inner values. 逐行读取时,您可以替换编码内部值。 The named capture group (?.*?) will get everything between the nodes into the group named xml. 命名捕获组(?。*?)将把节点之间的所有内容都放入名为xml的组中。

var regex = "<.*?>(?<xml>.*?)</.*?>"
var badXML = Regex.Match(line, regex , RegexOptions.IgnoreCase).Groups["xml"].Value;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM