简体   繁体   中英

Encoding Special Characters Before Processing Invalid XML

I have some invalid XML from a vendor that I need to process. Here is an example:

<a>foo</a>
<b>bar</b>
<c>foobar is < $15</c>

So, we have a few problems. First, there is no root document. I overcome that by adding a root document. No problem. The second, and more difficult problem, is the less than symbol. I can just encode the whole thing but it will encode the XML tags. Is there a library or simple method out there somewhere for handling this? I really don't want to reinvent the wheel as I'm sure hundreds of people have dealt with "quasi-XML" like this. Appreciate any help.

I would read the file line by line and use a regex to get the values between the nodes. Your example doesn't have nested elements so this is pretty easy. While reading line by line you can replace encode the inner values. The named capture group (?.*?) will get everything between the nodes into the group named xml.

var regex = "<.*?>(?<xml>.*?)</.*?>"
var badXML = Regex.Match(line, regex , RegexOptions.IgnoreCase).Groups["xml"].Value;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM