简体   繁体   English

Mime 7位编码和UnsupportedEncodingException

[英]Mime 7bit encoding and UnsupportedEncodingException

I have implemented an approach, but I am not sure whether it is the a correct one or could give me problems in the future. 我已经实现了一种方法,但是我不确定这是正确的方法还是将来会给我带来麻烦。
Giving this piece of email: 给这封电子邮件:

Date: Mon, 17 Sep 2012 04:14:36 +0200   
Content-Type: text/plain;
    charset="utf-7"   
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000
To: user@address.com

Dear Sir/madam, ... etc

And this piece of code: 这段代码:

MimePart part; //The email 
if (part.isMimeType("text/plain")) {
   String plainContent = part.getContent().toString();

The exception was: 例外是:

java.io.UnsupportedEncodingException: utf-7

I have made this modification, so the charset is always utf-8 and the encoding quoted-printable 我已经进行了此修改,因此字符集始终为utf-8 ,编码为quoted-printable

part.setHeader("Content-Transfer-Encoding", "quoted-printable");
part.setHeader("Content-Type", "text/plain; charset=utf-8");

The exception is not there anymore and the plainContent is correct. 异常不再存在, plainContent是正确的。 But it seems to be too easy solution... Which problems could I get in the future? 但这似乎太简单了。。。将来我会遇到哪些问题? Is there a better way to skip the exception and to get the email content without forcing a carset and encoding?? 有没有更好的方法可以跳过异常并获取电子邮件内容而无需强制使用车载音响系统和编码?

If somebody really sends UTF-7, you will cause the client to decode it incorrectly. 如果有人真的发送了UTF-7,则将导致客户端错误地对其进行解码。 But it's quite rare; 但这很罕见。 most sites send UTF-8 if they use Unicode at all. 如果大多数站点完全使用Unicode,则它们会发送UTF-8。 For the sample content you posted, it's pure ASCII, so it's valid both UTF-7 and UTF-8. 对于您发布的示例内容,它是纯ASCII,因此对UTF-7和UTF-8均有效。 (UTF-7 assigns special semantics to + and - so for a message which contains sequences of these characters, even ASCII is not safe. That is, UTF-7 incorrectly labeled as US-ASCII or vice versa will decode incorrectly.) (UTF-7为+和-分配了特殊的语义,因此对于包含这些字符序列的消息,即使是ASCII也不安全。也就是说,UTF-7被错误地标记为US-ASCII,反之亦然将被错误地解码。)

Assigning Quoted-Printable to stuff which really isn't is similarly haphazard; 类似地,将Quoted-Printable分配给实际上不是的东西; any equals sign in the message has special meaning in QP. 消息中的任何等号在QP中具有特殊含义。 I think you should just leave it. 我想你应该离开它。

The proper solution is to really recode the message body, ie translate from UTF-7 to UTF-8 (and possibly wrap it in quoted-printable), then assign the correct content-type header; 正确的解决方案是对消息正文进行真正的重新编码,即从UTF-7转换为UTF-8(并可能将其包装在quoted-printable中),然后分配正确的content-type标头。 or, convince whatever is sending these messages to stick to plain old US-ASCII or switch to UTF-8. 或者说服发送这些消息的任何对象,使其遵循普通的旧US-ASCII或切换到UTF-8。 (Or, find out how to teach Java to handle UTF-7 encoding; but that's outside my competence.) (或者,找到如何教Java处理UTF-7编码的方法,但这超出了我的能力。)

See also http://en.wikipedia.org/wiki/UTF-7 另请参见http://en.wikipedia.org/wiki/UTF-7


Basic RFC822 email was purely 7-bit. 基本RFC822电子邮件纯粹是7位。 In order to enable rich content and different character sets, MIME was developed in the early 1990s. 为了实现丰富的内容和不同的字符集, MIME是在1990年代初期开发的。 Central to your question are two MIME headers, Content-Type: and Content-Transfer-Encoding: . 问题的中心是两个MIME标头: Content-Type:Content-Transfer-Encoding: These are both used to identify the type of a MIME part, but they are distinct concepts. 这些都用于标识MIME部件的类型,但是它们是不同的概念。 The Content-Type describes what the data is ( text/html , audio/midi , application/octet-stream for untyped binary data, etc). Content-Type描述数据是什么( text/htmlaudio/midi ,用于未类型化二进制数据的application/octet-stream等)。 The Content-Transfer-Encoding: indicates how it has been encoded for transmission over email (or another MIME conduit). Content-Transfer-Encoding:指示如何对其进行编码以通过电子邮件(或其他MIME管道)进行传输。

Content-Transfer-Encoding: basically defines two encodings and three unencoded types. Content-Transfer-Encoding:基本上定义了两种编码和三种未编码的类型。 CTE: 7bit indicates that the data, by itself, is suitable for transmission over a 7-bit channel (there is also a line length restriction); CTE: 7bit表示数据本身适合在7位通道上传输(还有行长限制); 8bit is not, and will need to be re-encoded if the channel cannot accommodate 8-bit data. 8bit不是,如果通道不能容纳8位数据,则​​需要重新编码。 Similarly, binary is also 8-bit but in addition has no guarantee on line length (ie it may contain lines longer than approx 1,000 characters). 类似地, binary也是8位,但除此之外不能保证行长(即,它可能包含长于大约1,000个字符的行)。 So to transmit binary or 8-bit data across a 7-bit channel, you need to recode the content as base64 or quoted-printable . 因此,要通过7位通道传输binary8-bit数据,您需要将内容重新编码为base64quoted-printable Both of these encodings substitute 8-bit characters with 7-bit sequences; 这两种编码都将8位字符替换为7位序列。 the recipient is expected to perform the reverse substitution in order to decode and extract the data. 期望接收者执行反向替换以便解码和提取数据。

Once the extraction happens, the data is basically ready for use at the recipient end. 一旦提取完成,数据就基本上准备好在接收方使用。 However, for text types, there is also the matter of character set encoding. 但是,对于文本类型,还有字符集编码的问题。 Many character sets are simply 7-bit or 8-bit, and so a byte in the stream corresponds to a character. 许多字符集只是7位或8位,因此流中的一个字节对应于一个字符。 But multibyte character sets do not behave like this, and so they, too, need to be encoded somehow. 但是多字节字符集的行为并非如此,因此也需要以某种方式对其进行编码。 But this is distinct from the MIME 7bit/8bit thing described above. 但这与上面描述的MIME 7bit / 8bit不同。 A character encoding tells you how the byte stream encodes multi-byte characters. 字符编码告诉您字节流如何编码多字节字符。

UTF-8 encodes a multibyte character as a sequence of 8-bit characters (while conveniently 7-bit characters are identical to the US-ASCII 7-bit encoding). UTF-8将多字节字符编码为8位字符序列(而便利的7位字符与US-ASCII 7位编码相同)。 The encoding has some nice properties which you can read about in Wikipedia. 编码具有一些不错的属性,您可以在Wikipedia中阅读。

UTF-7 was never formally accepted as an official Unicode encoding, and is not in widespread use. UTF-7从未被正式接受为正式的Unicode编码,因此并未得到广泛使用。 It is not entirely compatible with US-ASCII, because the + and - characters are used to encode multibyte character sequences. 它不完全与US-ASCII兼容,因为+-字符用于编码多字节字符序列。

If you wish to decode UTF-7 and your language does not support the encoding, you will have to write your own decoder. 如果您想解码UTF-7,并且您的语言不支持编码,则必须编写自己的解码器。 The alternative is not to decode the encoding, and leave it to the downstream consumer to decode. 替代方案是不对编码进行解码,而将其留给下游使用者进行解码。 Take care to somehow relay the character encoding to the downstream in this case. 在这种情况下,请务必以某种方式将字符编码中继到下游。 However, because UTF-7 is not widely supported, I would recommend recoding to UTF-8, which is widely supported and understood (and also, as mentioned, transparently compatible with US-ASCII if no multibyte characters are present). 但是,由于未广泛支持UTF-7,因此我建议将其重新编码为UTF-8,这已得到广泛支持和理解(并且如上所述,如果不存在多字节字符,则与US-ASCII透明兼容)。

So, just to summarize; 因此,仅作总结; if you change the headers, you also have to change the encoding. 如果更改标题,则还必须更改编码。 If you are lucky (and your example is representative) the text doesn't contain any actual encoded UTF-7 multibyte characters, in which case you can safely relabel it as US-ASCII. 如果您很幸运(并且您的示例具有代表性),该文本不包含任何实际编码的UTF-7多字节字符,在这种情况下,您可以安全地将其重新标记为US-ASCII。 If it does contain + or - characters, they are part of UTF-7 sequences which need to be decoded (though again, you could be lucky, and the sequences are just the UTF-7 escapes which encode a literal plus or minus sign). 如果它确实包含+-字符,则它们是需要解码的UTF-7序列的一部分(不过,您可能还很幸运,并且这些序列只是UTF-7转义符,其编码文字的正负号) 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM