简体   繁体   中英

read encoding identifier with StreamReader

I am reading a C# book and in the chapter about streams it says:

If you explicitly specify an encoding, StreamWriter will, by default, write a prefix to the start of the stream to identify the encoding. This is usually undesirable and you can prevent it by constructing the encoding as follows:

var encoding = new UTF8Encoding (encoderShouldEmitUTF8Identifier:false, throwOnInvalidBytes:true);

I'd like to actually see how the identifier looks so I came up with this code:

            using (FileStream fs = File.Create ("test.txt"))
            using (TextWriter writer = new StreamWriter (fs,new UTF8Encoding(true,false)))
            {
                writer.WriteLine ("Line1");
            }

            using (FileStream fs = File.OpenRead ("test.txt"))
            using (TextReader reader = new StreamReader (fs))
            {
                for (int b; (b = reader.Read()) > -1;)
                    Console.WriteLine (b + " " + (char)b);  // identifier not printed
            }

To my dissatisfaction, no identifier was printed. How do I read the identifier? Am I missing something?

By default, .NET will try very hard to insulate you from encoding errors. If you want to see the byte-order-mark, aka "preamble" or "BOM", you need to be very explicit with the objects to disable the automatic behavior. This means that you need to use an encoding that does not include the preamble, and you need to tell StreamReader to not try to detect the encoding.

Here is a variation of your original code that will display the BOM:

using (MemoryStream stream = new MemoryStream())
{
    Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);

    using (TextWriter writer = new StreamWriter(stream, encoding, bufferSize: 8192, leaveOpen: true))
    {
        writer.WriteLine("Line1");
    }

    stream.Position = 0;
    encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);

    using (TextReader reader = new StreamReader(stream, encoding, detectEncodingFromByteOrderMarks: false))
    {
        for (int b; (b = reader.Read()) > -1;)
            Console.WriteLine(b + " " + (char)b);  // identifier not printed
    }
}

Here, encoderShouldEmitUTF8Identifier: true is passed to the encoder used to create the stream, so that the BOM is written when the stream is created, but encoderShouldEmitUTF8Identifier: false is passed to the encoder used to read the stream, so that the BOM will be treated as a normal character when the stream is being read back. The detectEncodingFromByteOrderMarks: false parameter is passed to the StreamReader constructor as well, so that it won't consume the BOM itself.

This produces this output, just like you wanted:

65279 ?
76 L
105 i
110 n
101 e
49 1
13
10

It is worth mentioning that use of the BOM as a form of identifying UTF8 encoding is generally discouraged. The BOM mainly exists so that the two variations of UTF16 can be distinguished (ie UTF16LE and UTF16BE, "little endian" and "big endian", respectively). It's been co-opted as a means of identifying UTF8 as well, but really it's better to just know what the encoding is (which is why things like XML and HTML explicitly state the encoding as ASCII in the first part of the file, and MIME's charset property exists). A single character isn't nearly as reliable as other more explicit means.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM