Xml exception due to leading unicode character in REST API response

Question

When I try to parse a response from a certain REST API, I'm getting an XmlException saying "Data at the root level is invalid. Line 1, position 1." Looking at the XML it looks fine, but then examining the first character I see that it is actually a zero-width no-break space (character code 65279 or 0xFEFF).

Is there any good reason for that character to be there? Maybe I'm supposed to be setting a different Encoding when I make my request? Currently I'm using Encoding.UTF8 .

I've thought about just removing the character from the string, or asking the developer of the REST API to fix it, but before I do either of those things I wanted to check if there is a valid reason for that character to be there. I'm no unicode expert. Is there something different I should be doing?

Edit: I suspected that it might be something like that (BOM). So, the question becomes, should I have to deal with this character specially? I've tried loading the XML two ways and both throw the same exception:

public static User GetUser()
{
    WebClient req = new WebClient();
    req.Encoding = Encoding.UTF8;
    string response = req.DownloadString(url);

    XmlSerializer ser = new XmlSerializer(typeof(User));
    User user = ser.Deserialize(new StringReader(response)) as User;

    XElement xUser = XElement.Parse(response);

    ...

    return user;
}

Answer 1

U+FFEF is a byte order mark . It's there at the start of the document to indicate the character encoding (or rather, the byte-order of an encoding which could be either way; most specifically UTF-16). It's entirely reasonable for it to be there at the start of an XML document. Its use as a zero-width non-breaking space is deprecated in favour of U+2060 instead.

It would be unreasonable if the byte-order mark was in a different encoding, eg if it were a UTF-8 BOM in a document which claimed to be UTF-8.

How are you loading the document? Perhaps you're specifying an inappropriate encoding somewhere? It's best to let the XML API detect the encoding if at all possible.

EDIT: After you've download it as a string , I can imagine that could cause problems... given that it's used to detect the encoding, which you've already got. Don't download it as a string - download it as binary data ( WebClient.DownloadData ) and then you should be able to parse it okay, I believe. However, you probably still shouldn't use XElement.Parse as there may well be a document declaration - use XDocument.Parse . I'd be slightly surprised if the result of the call could be fed straight into XmlSerializer , but you can have a go... wrap it in a MemoryStream if necessary.

Answer 2

That is called a Byte Order Mark . It's not required in UTF-8 though.

Answer 3

Instead of using Encoding.UTF8, create your own UTF-8 encoder, using the constructor overload that lets you specify whether or not the BOM is to be emitted:

req.Encoding = new UTF8Encoding( false ) ; // omit the BOM

I believe that will do the trick for you.

Amended to Note: The following will work:

public static User GetUser()
{
  WebClient req   = new WebClient();
  req.Encoding    = Encoding.UTF8;
  byte[] response = req.DownloadData(url);

  User instance ;

  using ( MemoryStream stream = new MemoryStream(buffer) )
  using ( XmlReader    reader = XmlReader.Create( stream ) )
  {
    XmlSerializer serializer = new XmlSerializer(typeof(User)) ;
    instance = (User) serializer.Deserialize( reader ) ;
  }

  return instance ;
}

Answer 4

That character at the beginning is the BOM (Byte Order Mark). It's placed as the first character in unicode text files to specify which encoding was used to create the file.

The BOM should not be part of the response, as the encoding is specified differently for HTTP content.

Typically a BOM in the response comes from sending a text file as response, where the text file was saved with the BOM signature. Visual Studio for example has an option to save a file without the BOM signature so that it can be send directly as a response.

Xml exception due to leading unicode character in REST API response

Question

4 answers

solution1
3 2011-06-03 18:46:00

solution2
2 2011-06-03 18:46:54

solution3
1 ACCPTED 2011-06-03 19:53:46

solution4
0 2011-06-03 18:46:42

Xml exception due to leading unicode character in REST API response

Question

4 answers

solution1 3 2011-06-03 18:46:00

solution2 2 2011-06-03 18:46:54

solution3 1 ACCPTED 2011-06-03 19:53:46

solution4 0 2011-06-03 18:46:42

solution1
3 2011-06-03 18:46:00

solution2
2 2011-06-03 18:46:54

solution3
1 ACCPTED 2011-06-03 19:53:46

solution4
0 2011-06-03 18:46:42