XML Parser gets stuck on special characters, despite encoding

Question

This is the situation as it is:

I'm receiving data from an XML API. This data sometimes contains a special apostrophe character, which causes my parser to crash. This crash only occurs when I read the data from a local file. When I read the data from the stream there is no crash, but I don't get a DOM tree either: it exits without notifying me.

Below you will find a list of attempts we've made to make things work:

// Does not work
var web = new WebClient();
web.Encoding = Encoding.UTF8;
var response = web.DownloadString("http://thetvdb.com/api/apikey/series/" + show.TVDBID + "/");
var tree = XDocument.Parse(response);

// Works
var doc = new XmlDocument();
doc.Load("C:\\Test\\test.xml");
var response = doc.InnerXml;
var tree = XDocument.Parse(response);

// Works
var xmlDoc = XDocument.Parse(File.ReadAllText("c:\\Test\\test.xml", System.Text.Encoding.UTF8));
var xmlDoc = XDocument.Load("C:\\Test\\test.xml");
var tree = xmlDoc;

// Does not work
var web = new WebClient();
web.Encoding = Encoding.UTF8;
web.DownloadFile("http://thetvdb.com/api/apikey/series/" + show.TVDBID + "/", "C:\\test.xml");
var tree = XDocument.Load("C:\\test.xml");

// Does not work
var web = new WebClient();
web.Encoding = Encoding.UTF8;
var data = web.DownloadData("http://thetvdb.com/api/apikey/series/" + show.TVDBID + "/");
var response = Encoding.UTF8.GetString(data);
var tree = XDocument.Parse(response);

I determine whether or not something works depending on if it reaches the breakpoint in the first line of this loop:

if (root != null) { 
     var lastupdate = root.Element("Series").Element("lastupdated").Value;

     foreach (var epi in tree.Descendants("Episode")) {
          var season = epi.Element("SeasonNumber").Value; // Breakpoint here
     }
}

The crashes happen when the parser encounters this apostrophe: 在此输入图像描述

When I replace this character with my own manually entered apostrophe or with &#39 , there is no more error thrown and it continues untill the next one. When I view the source page of the API request in firefox and chrome, it tells me the encoding is UTF-8 and code examples on the API wiki also show UTF-8 in the header.

This is where I am so far. Any ideas?

I just noticed that my result string from the API query only contains a <Series></Series> tag according to the XML/Text/HTML visualizer during debugging, and no <Episode></Episode> ones. However, when I execute the same query in my browser it shows me both. Is this possible? When I look at it trough Postman, it shows the episodes.

Update:

When I use Unicode as encoding, I don't receive any warnings and I'm able to completely parse the local xml file! I'm not an encoding expert, are there any downsides to using Unicode?

When using unicode for the stream of data, I get a bunch of asian characters.

Answer 1

It has to do with the encoding of your data. This allows you to get raw binary (so no problems with encoding).

WebClient myWebClient = new WebClient();
byte[] data = myWebClient.DownloadData(uri);

string xmlContents = Encoding.UTF8.GetString(data);

EDIT Following your most recent developments with Unicode I would say that the data is actually encoded in UTF-16. Unicode is not an encoding type, it's essentially just a coded character set - ie a set of characters and a mapping between the characters and integer code points representing them. When you "encode something in Unicode" it usually means UTF-16. Anyway, glad that your problem is solved!

Answer 2

Try,

var tree = XElement.Parse(response);
foreach(var epi in tree.Descendants("Episode"))
{
   ...
}

If Data is your root node and there are no buried Episode's, then you can replace Descendants with Elements.

Answer 3

&#39 is an html escape for certain browsers. Use ' instead, it is the correct xml escape sequence.

It looks likely that you got "smart quoted" by one of those annoying microsoft products that changes all your quotes and apostrophes to curly ones that claim to be in ISO-8859-1/Latin-1, but are really Win-1252 with a missing C0 plane. If that is the case, only a Win-1252 is encoding is going to parse that document for you. Or you can switch out the curly apos for a regular one and all will be ok.

Answer 4

I've found the solution and it's somewhat anticlimatic. The episodes weren't retrieved because my API string was incomplete: it was supposed to end with /all/ , but I must have forgotten it somewhere and copied from that point forward. It was the last place I was looking.

By changing the API call I could now retrieve all episodes. There are no more encoding errors (even though I changed nothing to that) and right now it has already retrieved 4000 episodes, so I'm assuming the rest will go without issues as well.

Someone made this a community wiki: I'm not sure if that status is still warranted, seeing as this was a localized issue. I've learned a lot about XML/APIs from these conversations though, thanks to everyone involved!

XML Parser gets stuck on special characters, despite encoding

Question

4 answers

solution1
1 ACCPTED 2013-06-23 11:10:26

solution2
0 2013-06-23 01:42:34

solution3
0

solution4
0

XML Parser gets stuck on special characters, despite encoding

Question

4 answers

solution1 1 ACCPTED 2013-06-23 11:10:26

solution2 0 2013-06-23 01:42:34

solution3 0

solution4 0

solution1
1 ACCPTED 2013-06-23 11:10:26

solution2
0 2013-06-23 01:42:34

solution3
0

solution4
0