简体   繁体   中英

Parsing XML in VB.Net is failing due to a special character

I have some VB.Net code which is parsing an XML string.

The XML String comes from a TCP 3rd Party stream and as such we have to take the data we get and deal with it. The issue we have is that one of the elements data can sometimes contain special characters eg &, $ , < and thus when the “XMLDoc.LoadXml(XML)” is executed it fails - note XMLDoc is configured as "Dim XMLDoc As XmlDocument = New XmlDocument()".

Have tried to Google answers for this but I am really struggling to find a solution. Have looked at a RegEX but realised this has some limitations; or I just dont understand it enough lol.

If it helps here is an example of XLM we would have streamed to us (just for info the message tag comes from an SMS message):- (if it helps the only bit that will ever have an error is (and all I have to check) the <Message>O&N</Message> section so in this case the message has come in with an &)

<IncomingMessage><DeviceSendTime>19/02/2013 14:00:50</DeviceSendTime>
 <Sender>0000111111</Sender>
 <Status>New</Status>
 <Transport>Sms</Transport>
 <Id>-1</Id>
 <Message>O&N</Message>
 <Timestamp>19/02/2013 14:00:50</Timestamp>
 <ReadTimestamp>19/02/2013 14:00:50</ReadTimestamp>
</IncomingMessage>

If we're looking specifically within Message elements, and assuming there are no nested elements within the Message element:

Dim url = "put url here"
Dim s As String

Dim characterMappings = New Dictionary(Of String, String) From {
    {"&", "&amp;"},
    {"<", "&lt;"},
    {">", "&gt;"},
    {"""", "&quot;"}
}

Using client As New WebClient
    s = client.DownloadString(url)
End Using
s = Regex.Replace(s,
    "(?:<Message>).*?(" & String.Join("|", characterMappings.Keys) & ").*?(?:</Message>)",
    Function(match) characterMappings(match.Groups(1).Value)
)
Dim x = XDocument.Parse(s)

$ should not be an issue with XML, but if it is you can add it to the dictionary.

Use of WebClient comes from here .

Updated

Since $ has special meaning in regular expressions, it cannot be simply added to the dictionary; it needs to be escaped with \\ in the regular expression pattern. The simplest way to do this, would be to write the pattern manually, instead of joining the keys to the dictionary:

s = Regex.Replace(s,
    "(?:<Message>).*?(&|<|>|\$).*?(?:</Message>)",
    Function(match) characterMappings(match.Groups(1).Value)
)

Also, I highly recommend Expresso for working with regular expressions.

Your XML is invalid and hence it is not XML. Either fix code that generates XML (correct approach) or pretend this is text file and enjoy all problems with parsing non-structured text.

As you've stated in the question <Message>O&N</Message> is not valid XML. Most likely reason of such "XML" is using string concatenation to construct it instead of using proper XML manipulation methods. Unless you use some arcane language all practically used languages have built in or library support for XML creation so it should not be to hard to create XML right.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM