简体   繁体   中英

Encoding of ASCII string in UTF8 XML document in Byte array

I have some the folowing requirements:

...The document must be encoded in UTF-8 ... The Lastname field only allows (Extended) ASCII ... City only allows ISOLatin1 ...The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage

The XML document, for simplicities sake, looks like this:

<?xml version="1.0" encoding="utf-8"?>
<foo>
  <lastname>John ÐØë</lastname>
  <city>John ÐØë</city>
  <other>UTF-8 string</other>
</foo>

The "ÐØë" part are (or should be) ASCII values 208, 216, 235 respectively.

I also have an object:

public class foo {
  public string lastname { get; set; }
}

So I instantiate an object and set the lastname:

var x = new foo() { lastname = "John ÐØë", city = "John ÐØë" };

Now this is where my headache sets in (or the inception if you will...):

  • Visual studio / source code is in Unicode
  • Hence: Object has an Unicode lastname
  • The XML Serializer uses UTF-8 to encode the document
  • Lastname should contain only (Extended) ASCII characters; the characters are valid ASCII chars but ofcourse in UTF-8 encoded form

I normally don't experience any trouble with my encodings; I am familiar with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but this one's got me stumped...

I understand that the UTF-8 document will be perfectly able to "contain" both encodings because the codepoints 'overlap'. But where I get lost is when I need to convert the serialized message to a byte-array. When doing a dump I see C3 XX C3 XX C3 XX (I don't have the actual dump at hand). It's clear (or I've been staring at this for too long) that the lastname / city strings are put in the serialized document in their unicode form; the byte-array suggests so.

Now what will I have to do, and where, to ensure the Lastname string goes into the XML document and finally the byte-array as an ASCII string (and the actual 208, 216, 235 byte sequence), and that City makes it in there as ISOLatin1 ?

I know the requirements are backwards, but I can't change those (3rd party). I always use UTF-8 for our internal projects so I have to support the unicode-utf8=>ASCII/ISOLatin1 conversion (ofcourse, only for chars that are in those sets).

My head hurts...

Never mind how the XML document is encoded for transmission. The right way to do what you want to do—encode certain non-ASCII characters so they survive the trip unscathed—is to use XML character references to represent the characters that need to be so preserved. For instance, your

ÐØë

is represented using XML character references as

&#x00D0;&#x00D8;&#x00EB;

The receiving [conformant] XML processor will/should/must convert those numeric character references back to the characters they represent. Here's some code that will do the trick:

public static string ConvertToXmlCharacterReference( this string xml )
{
  StringBuilder sb  = new StringBuilder( s.Length ) ;
  const char    SP  = '\u0020' ; // anything lower than SP is a control character
  const char    DEL = '\u007F' ; // anything above DEL isn't ASCII, per se.

  foreach( char ch in xml )
  {
    bool isPrintableAscii = ch >= SP && ch <= DEL ;

    if ( isPrintableAscii ) { sb.Append(ch)                             ; }
    else                    { sb.AppendFormat( "&#x{0:X4}" , (int) ch ) ; }

  }

  string instance = sb.ToString() ;
  return instance ;
}

You could also use a regular expression to make the replacement or write an XSLT that would do the same thing. But the task is so trivial, it doesn't really warrant that sort of approach. The above code is probably faster and less memory intensive and...it's easier to understand.

You should note though that since you want to preserve two different encodings in the same document, your conversion routine will need to differentiate between the conversion from "extended ASCII" to an XML character reference and the conversion from "ISO Latin 1" to an XML character reference.

In both cases, the character reference specifies a codepoint in the ISO/IEC 10646 character set — essentially unicode. You'll want to map the characters to the appropriate code point. Since string in the CLR world are UTF-16 encoded, that shouldn't be much of an issue. The above code should work fine, I believe, unless you've get something really oddball that doesn't play very nicely with UTF-16.

I understand this as 2 separate requirements:

1) The XML must be UTF-8 encoded;

2) The City name is limited to ISOLatin1.

This means that when you decode UTF-8 to Uncode, the City characters are only from ISOLatin1 set. In other words, the XML can be ISOLatin1 encoded (all text is from ISOLatin1 code table) but it is UTF-8. ISOLatin1 is small part of Unicode table and UTF-8 is 8-bit encoding of Unicode.

所以.. System.Text.Encoding.ASCII.GetBytes(string)可能会做你想要的..将一个字符串转换为一个ascii编码的字节数组。

You simply can't have 208, 216, 235 byte sequence in UTF-8 encoded string/byte array.

I hope you can save XML as ISO 8859-1 with or without mentioning it in XML <?xml version="1.0" encoding="XXXXXXXXXX"?> processing instruction (maybe even specifying invalid UTF-8 encoding in XML header).

Otherwise if your requirements are as you stated - just ask for exact expected byte array for given input and craft your own custom serialization (or maybe custom encoding, also not sure if it is possible).

The document must be encoded in UTF-8. The Lastname field only allows ASCII. City only allows ISOLatin1. The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage .

If that is the precise specification, then I think you might be misunderstanding it. Your task is not one of encoding, but one of validation/fallback. The entire document – including the Lastname and City fields – must be encoded as UTF-8. Quite simply, the XML document would be invalid if it declares its encoding as UTF-8 and then contains byte values that are not valid under that encoding.

Conveniently, ASCII overlaps with the first 128 codepoints of Unicode; Latin1 overlaps with the first 256.

To check whether Lastname can be represented as ASCII, then you could check that all its characters have codepoints within the 0–127 range.

bool isLastnameAscii = foo.Lastname.All(c => (int)c < 128);

To conform with your specification, you would have to force invalid characters to fall back to the replacement character (typically ? ) by encoding the string as ASCII, and then decoding it back:

foo.Lastname = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(foo.Lastname));

Similarly for City :

bool isCityLatin1 = foo.City.All(c => (int)c < 256);

Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
foo.City = latin1.GetString(latin1.GetBytes(foo.City));

Subsequently, you should just save everything as UTF-8.

My assumption is that your third-party software can correctly decode the XML document using UTF-8; however, it must then extract the Lastname and City fields, and use them somewhere where only ASCII and Latin1 are allowed. It imposes the restrictions on you in order to ensure that it would not be forced to incur data loss (because of the presence of disallowed characters).

Edit : This is the workaround that you're proposing. I'm using Latin1 in the place of “Extended ASCII” because the latter term is ambiguous.

var x = new foo() { lastname = "John ÐØë", city = "John ÐØë", other = "—" };

using (var stream = new MemoryStream())
using (var utf8writer = new StreamWriter(stream, Encoding.UTF8))            
using (var latin1writer = new StreamWriter(stream, Encoding.GetEncoding("iso-8859-1")))
{
    utf8writer.WriteLine("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
    utf8writer.WriteLine("<foo>");
    utf8writer.Flush();

    latin1writer.WriteLine("  <lastname>" + SecurityElement.Escape(x.lastname) + "</lastname>");
    latin1writer.WriteLine("  <city>" + SecurityElement.Escape(x.city) + "</city>");
    latin1writer.Flush();

    utf8writer.WriteLine("  <other>" + SecurityElement.Escape(x.other) + "</other>");
    utf8writer.WriteLine("/<foo>");
    utf8writer.Flush();

    byte[] bytes = stream.ToArray();
}

SecurityElement.Escape replaces invalid XML characters in a string with their valid XML equivalent (eg < to &lt and & to &amp; ).

Accepted answer from Nicholas Carey is OK, but it has errors and code doesn't work. I don't have enough reputation to comment so I will write working code here:

public static string ConvertToXmlCharacterReference(string xml)
    {
        StringBuilder sb = new StringBuilder();
        const char SP = '\u0020'; // anything lower than SP is a control character
        const char DEL = '\u007F'; // anything above DEL isn't ASCII, per se.
        int i = 0;
        foreach (char ch in xml)
        {
            bool isPrintableAscii = ch >= SP && ch <= DEL;
            if (isPrintableAscii)
            {
                sb.Append(ch);
            }
            else
            {
                sb.AppendFormat("&#x{0:X4};", (int) ch);
            }
        }
        string instance = sb.ToString();
        return instance;
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM