简体   繁体   English

如何从无限字节流中读取UTF-8字符 - C#

[英]How do you read UTF-8 characters from an infinite byte stream - C#

Normally, to read characters from a byte stream you use a StreamReader. 通常,要从字节流中读取字符,请使用StreamReader。 In this example I'm reading records delimited by '\\r' from an infinite stream. 在这个例子中,我正在从无限流中读取由'\\ r'分隔的记录。

using(var reader = new StreamReader(stream, Encoding.UTF8))
{
    var messageBuilder = new StringBuilder();
    var nextChar = 'x';
    while (reader.Peek() >= 0)
    {
        nextChar = (char)reader.Read()
        messageBuilder.Append(nextChar);

        if (nextChar == '\r')
        {
            ProcessBuffer(messageBuilder.ToString());
            messageBuilder.Clear();
        }
    }
}

The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived). 问题是StreamReader有一个小的内部缓冲区,所以如果代码等待'记录结束'分隔符(在这种情况下为'\\ r'),它必须等到StreamReader的内部缓冲区被刷新(通常是因为更多的字节)已经到了)。

This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters. 此替代实现适用于单字节UTF-8字符,但在多字节字符上将失败。

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
    var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

How can I modify this code so that it works with multi-byte characters? 如何修改此代码以使其适用于多字节字符?

获取一个Decoder实例并重复调用其成员方法GetChars而不是Encoding.UTF8.GetChars ,它将使用Decoder的内部缓冲区处理部分多字节序列。打电话给下一个。

Thanks to Richard, I now have a working infinite stream reader. 感谢理查德,我现在有一个工作无限的流阅读器。 As he explained, the trick is to use a Decoder instance and call its GetChars method. 正如他解释的那样,诀窍是使用Decoder实例并调用其GetChars方法。 I've tested it with multi-byte Japanese text and it works fine. 我用多字节日文文本测试它,它工作正常。

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];

while ((byteAsInt = stream.ReadByte()) != -1)
{
    var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
    if(charCount == 0) continue;

    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

I don't understand why you're not using the stream reader's ReadLine method. 我不明白你为什么不使用流阅读器的ReadLine方法。 If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. 但是,如果没有充分的理由,那么在我看来,在解码器上反复调用GetChars效率很低。 Why not make use of the fact that the byte representation of '\\r' can't be part of a multi-byte sequence? 为什么不利用'\\ r'的字节表示不能成为多字节序列的一部分呢? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.) (多字节序列中的字节必须大于127;也就是说,它们的位设置最高。)

var messageBuilder = new List<byte>();

int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
    messageBuilder.Add((byte)byteAsInt);

    if (byteAsInt == '\r')
    {
        var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
        Console.Write(messageString);
        ProcessBuffer(messageString);
        messageBuilder.Clear();
    }
}

Mike, I found your solution perfect for my situation as well. 迈克,我发现你的解决方案也适合我的情况。 But I noticed that sometimes it takes four GetChar() calls to determine the characters to be returned. 但我注意到有时需要四次GetChar()调用才能确定要返回的字符。 This meant that charCount was 2, while my nextChar buffer size was 1. So I got error "The output character buffer is too small to contain the decoded characters, encoding Unicode fallback System.Text.DecoderReplacementFallback." 这意味着charCount为2,而我的nextChar缓冲区大小为1.所以我得到错误“输出字符缓冲区太小而不能包含解码的字符,编码Unicode回退System.Text.DecoderReplacementFallback。”

I changed my code to: 我将代码更改为:

    // ...
    var nextChar = new char[4];  // 2 might suffice

    for (var i = startPos; i < bytesRead; i++)
    {
        int charCount;
        //...
        charCount = decoder.GetChars(buffer, i, 1, nextChar, 0);

        if (charCount == 0)
        {
            bytesSkipped++;
            continue;
        }

        for (int ic = 0; ic < charCount; ic++)
        {
            char c = nextChar[ic];
            charPos++;

            // Process character here...
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM