简体   繁体   English

从NetworkStream读取时更改StreamReader编码

[英]Change StreamReader Encoding while reading from NetworkStream

I am trying to read an email from POP3 and change to the correct encoding when I find the charset in the headers. 我尝试从POP3中读取电子邮件,并在标题中找到字符集时更改为正确的编码。

I use a TCP Client to connect to the POP3 server. 我使用TCP客户端连接到POP3服务器。

Below is my code : 下面是我的代码:

    public string ReadToEnd(POP3Client pop3client, out System.Text.Encoding messageEncoding)
    {
        messageEncoding = TCPStream.CurrentEncoding;
        if (EOF)
            return ("");

        System.Text.StringBuilder sb = new System.Text.StringBuilder(m_bytetotal * 2);
        string st = "";
        string tmp;

        do
        {
            tmp = TCPStream.ReadLine();
            if (tmp == ".")
                EOF = true;
            else
                sb.Append(tmp + "\r\n");

            //st += tmp + "\r\n";

            m_byteread += tmp.Length + 2; // CRLF discarded by read

            FireReceived();

            if (tmp.ToLower().Contains("content-type:") && tmp.ToLower().Contains("charset="))
            {
                try
                {
                    string charSetFound = tmp.Substring(tmp.IndexOf("charset=") + "charset=".Length).Replace("\"", "").Replace(";", "");
                    var realEnc = System.Text.Encoding.GetEncoding(charSetFound);

                    if (realEnc != TCPStream.CurrentEncoding)
                    {
                        TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
                    }
                }
                catch { }
            }                
        } while (!EOF);

        messageEncoding = TCPStream.CurrentEncoding;

        return (sb.ToString());
    }

If I remove this line: 如果我删除此行:

TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);

Everything works fine except that when the e-mail contains different charset characters I get question marks as the initial encoding is ASCII. 一切工作正常,但是当电子邮件包含不同的字符集字符时,由于初始编码为ASCII,我会收到问号。

Any suggestions on how to change the encoding while reading data from the Network Stream? 关于从网络流读取数据时如何更改编码的任何建议?

You're doing it wrong (tm). 您做错了(tm)。

Seriously, though, you are going about trying to solve this problem in completely the wrong way. 严重的是,您将尝试以完全错误的方式解决此问题。 Don't use a StreamReader for this. 不要为此使用StreamReader。 And especially don't read 1 byte at a time (as you said you needed to do in a comment on an earlier "solution"). 特别是不要一次读取1个字节(正如您所说,您需要在对较早的“解决方案”的评论中进行此操作)。

For an explanation of why not to use a StreamReader, besides the obvious "because it isn't designed to switch between encodings during the process of reading", feel free to read over another answer I gave about the inefficiencies of using a StreamReader here: Reading an mbox file in C# 关于为何使用StreamReader的解释,除了显而易见的“因为它并非旨在在读取过程中在编码之间切换”之外,还可以阅读我给出的关于在这里使用StreamReader的效率低下的另一个答案: 在C#中读取mbox文件

What you need to do is buffer your reads (such as a 4k buffer should be fine). 您需要做的就是缓冲读取的内容(例如4k缓冲区应该没问题)。 Then, as you are already having to do anyway, scan for the '\\n' byte to extract content on a line-by-line basis, combining header lines that were folded. 然后,因为您已经必须执行此操作,所以扫描'\\n'字节以逐行提取内容,并合并折叠的标题行。

Each header may have multiple encoded-word tokens which may each be in a separate charset, assuming they are properly encoded, otherwise you'll have to deal with undeclared 8-bit data and try to massage that into unicode somehow (probably by having a set of fallback charsets). 每个标头可能具有多个编码字令牌,假设它们已正确编码,则每个编码字令牌可能位于单独的字符集中,否则,您将不得不处理未声明的8位数据,并尝试以某种方式将其压缩为unicode(可能是通过后备字符集集)。 I'd recommend trying UTF-8 first followed by a selection of charsets that the user of your library has provided before finally trying iso-8859-1 (make sure not to try iso-8859-1 until you've tried everything else, because any sequence of 8-bit text will convert properly to unicode using the iso-8859-1 character encoding). 我建议先尝试UTF-8,然后再选择图书馆用户提供的一组字符集,然后再尝试iso-8859-1(确保在尝试了所有其他方法之前,不要尝试iso-8859-1,因为任何8位文本序列都可以使用iso-8859-1字符编码正确转换为unicode)。

When you get to text content of the message, you'll want to check the Content-Type header for a charset parameter. 当您获得消息的文本内容时,您将需要检查Content-Type标头中的charset参数。 If no charset parameter is defined, it should be US-ASCII, but in practice it could be anything. 如果未定义任何charset参数, 则应为US-ASCII,但实际上可以为任何值。 Even if the charset is defined, it might not match the actual character encoding used in the text body of the message, so once again you'll probably want to have a set of fallbacks. 即使已定义字符集,它也可能与消息文本正文中使用的实际字符编码不匹配,因此,您可能再次想要一组备用。

As you've probably guessed by this point, this is very clearly not a trivial task as it requires the parser to do on-the-fly character conversion as it goes (and the character conversion requires internal parser state about what the expected charset is at any given time). 正如您可能已经猜到的那样,这显然不是一项琐碎的任务,因为它要求解析器在进行过程中进行即时的字符转换(并且字符转换需要内部解析器状态以了解预期的字符集是什么)在任何给定时间)。

Since I've already done the work, you should really consider using MimeKit which will parse the email and properly do charset conversion on the headers and the content using the appropriate charset encoding. 由于我已经完成了工作,因此您应该真正考虑使用MimeKit ,它将解析电子邮件,并使用适当的字符集编码对标头和内容进行字符集转换。

I've also written a Pop3Client class that is included in my MailKit library. 我还编写了包含在MailKit库中的Pop3Client类。

If your goal is to learn and write your own library, I'd still highly recommend reading over my code because it is highly efficient and does things in a proper way. 如果您的目标是学习和编写自己的库,我仍然强烈建议您阅读我的代码,因为它非常高效并且可以正确地执行操作。

There are some ways you can detect the encoding by looking at the Byte Order Mark, which are the firts few bytes of the stream. 有几种方法可以通过查看字节顺序标记来检测编码,这是流中少几个字节的地方。 These will tell you the encoding. 这些将告诉您编码。 However, the stream might not have a BOM, and in these cases it could be ASCII, UTF without BOM, or others. 但是,该流可能没有BOM,在这种情况下,它可以是ASCII,不带BOM的UTF或其他。

You can convert your stream from one encoding to another with the Encoding Class: 您可以使用Encoding类将流从一种编码转换为另一种编码:

Encoding textEncoding = Encoding.[your detected encoding here];
byte[] converted = Encoding.UTF8.GetBytes(textEncoding.GetString(TCPStream.GetBuffer()));

You may select your preferred encoding when converting. 您可以在转换时选择首选编码。

Hope it answers your question. 希望它能回答您的问题。

edit 编辑
You may use this code to read your stream in blocks. 您可以使用此代码以块的形式读取流。

MemoryStream st = new MemoryStream();
int numOfBytes = 1024;
int reads = 1;
while (reads > 0)
{
    byte[] bytes = new byte[numOfBytes];
    reads = yourStream.Read(bytes, 0, numOfBytes);
    if (reads > 0)
    {
        int writes = ( reads < numOfBytes ? reads : numOfBytes);
        st.Write(bytes, 0, writes);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM