简体   繁体   English

使用7BIT内容传输编码解析电子邮件正文 - PHP

[英]Parsing Email Body with 7BIT Content-Transfer-Encoding - PHP

I've been implementing some PHP/IMAP-based email handling functionality lately, and have most everything working great, except for message body decoding (in some circumstances). 我最近一直在实现一些基于PHP / IMAP的电子邮件处理功能,并且除了消息体解码(在某些情况下)之外,大多数工作都很好。

I think that, by now, I've half-memorized RFC 2822 (the 'Internet Message Format' document guidelines), read through email-handling code for half a dozen open source CMSes, and read a bajillion forum posts, blog posts, etc. dealing with handling email in PHP. 我认为,到目前为止,我已经记下了RFC 2822 (“互联网邮件格式”文档指南),通过电子邮件处理代码阅读了六个开源CMS,并阅读了一些bajillion论坛帖子,博客帖子,等处理PHP中的电子邮件。

I've also forked and completely rewritten a class for PHP, Imap , and the class handles email respectably well—I have some helpful methods in there to detect autoresponders (for out of office, old addresses, etc.), decode base64 and 8bit messages, etc. 我也分叉并完全重写了PHP的一个类, Imap ,并且该类处理电子邮件相当好 - 我有一些有用的方法来检测自动回复(对于不在办公室,旧地址等),解码base64和8bit消息等

However, the one thing I simply can't get to work reliably (or, sometimes, at all) is when a message comes in with Content-Transfer-Encoding: 7bit . 然而,有一件事我无法可靠地(或者有时甚至根本不能)工作,当一条消息带有Content-Transfer-Encoding: 7bit

It seems that different email clients/services interpret 7BIT to mean different things. 似乎不同的电子邮件客户端/服务将7BIT解释为不同的东西。 I've gotten some emails that are supposedly 7BIT that are actually Base64-encoded. 我收到了一些据称是7BIT电子邮件, 实际上是 Base64编码的。 I've gotten some that are actually quoted-printable-encoded. 我得到了一些实际上是引用可打印编码的。 And some that are not encoded in any way whatsoever. 还有一些不以任何方式编码。 And some that are HTML, but aren't indicated as being HTML, and they're also listed as 7BIT ... 有些是HTML,但未标明为HTML,它们也被列为7BIT ...

Here are a few examples (snips) of message bodies received with 7Bit encodings: 以下是使用7Bit编码接收的消息实体的一些示例(剪辑):

1: 1:

A random message=20

Sent from my iPhone

2: 2:

PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwi
IHhtbG5zOm89InVybjpzY2hlbWFzLW1pY3Jvc29mdC1jb206b2ZmaWNlOm9m

3: 3:

tangerine apricot pepper.=0A=C2=A0=0ALet me know if you have any availabili=
ty over the next month or so. =0A=C2=A0=0AThank you,=0ANames Withheld=0A908=
-319-5916=0A=C2=A0=0A=C2=A0=0A=C2=A0=0A=0A=0A______________________________=
__=0AFrom: Names Witheld =0ATo: Names Withheld=

These are all sent with '7Bit' encodings (well, at least according to PHP/ imap_* ), but they're obviously in need of more decoding before I can pass them along as plaintext. 这些与“7位”编码(好,至少根据PHP /发送imap_* ),但他们需要更多的解码显然是之前,我可以沿着明文传递他们。 Is there any way to reliably convert all messages with supposedly-7Bit encodings to plaintext? 有没有办法可靠地将所有带有7Bit编码的消息转换为纯文本?

After spending a bit more time, I decided to just write up some heuristic detection, as Max suggested in the comments on my original question. 花了一点时间之后,我决定写一些启发式检测,正如Max在我原来问题的评论中所建议的那样。

I've built a more robust decode7Bit() method in Imap.php , which goes through a bunch of common encoded characters (like =A0 ) and replaces them with their UTF-8 equivalents, and then also decodes messages if they look like they are base64-encoded: 我已经建立了一个更坚固decode7Bit()在方法Imap.php ,其经过一串共同经编码的字符(比如=A0 ),并用它们的UTF-8当量替换它们,并且如果它们看起来像他们然后还解码消息是base64编码的:

/**
 * Decodes 7-Bit text.
 *
 * PHP seems to think that most emails are 7BIT-encoded, therefore this
 * decoding method assumes that text passed through may actually be base64-
 * encoded, quoted-printable encoded, or just plain text. Instead of passing
 * the email directly through a particular decoding function, this method
 * runs through a bunch of common encoding schemes to try to decode everything
 * and simply end up with something *resembling* plain text.
 *
 * Results are not guaranteed, but it's pretty good at what it does.
 *
 * @param $text (string)
 *   7-Bit text to convert.
 *
 * @return (string)
 *   Decoded text.
 */
public function decode7Bit($text) {
  // If there are no spaces on the first line, assume that the body is
  // actually base64-encoded, and decode it.
  $lines = explode("\r\n", $text);
  $first_line_words = explode(' ', $lines[0]);
  if ($first_line_words[0] == $lines[0]) {
    $text = base64_decode($text);
  }

  // Manually convert common encoded characters into their UTF-8 equivalents.
  $characters = array(
    '=20' => ' ', // space.
    '=E2=80=99' => "'", // single quote.
    '=0A' => "\r\n", // line break.
    '=A0' => ' ', // non-breaking space.
    '=C2=A0' => ' ', // non-breaking space.
    "=\r\n" => '', // joined line.
    '=E2=80=A6' => '…', // ellipsis.
    '=E2=80=A2' => '•', // bullet.
  );

  // Loop through the encoded characters and replace any that are found.
  foreach ($characters as $key => $value) {
    $text = str_replace($key, $value, $text);
  }

  return $text;
}

This was taken from version 1.0-beta2 of the Imap class for PHP that I have on GitHub. 这是从我在GitHub上的PHPImap类的 1.0-beta2版本中获取的。

If you have any ideas for making this more efficient, let me know. 如果您有任何提高效率的想法,请告诉我。 I originally tried running everything through quoted_printable_decode() , but sometimes PHP would throw exceptions that were vague and unhelpful, so I gave up on that approach. 我最初尝试通过quoted_printable_decode()运行所有内容,但有时PHP会抛出模糊且无益的异常,所以我放弃了这种方法。

I know this is an old question.... But I am running into this issue now and it seems that PHP have a solution now. 我知道这是一个老问题....但我现在遇到这个问题,现在似乎PHP有一个解决方案。

this function imap_fetchstructure() will give you the type of encoding. 这个函数imap_fetchstructure()将为您提供编码类型。

0   7BIT
1   8BIT
2   BINARY
3   BASE64
4   QUOTED-PRINTABLE
5   OTHER

from there you should be able to create a function like this to decode the message 从那里你应该能够创建这样的函数来解码消息

function _encodeMessage($msg, $type){

            if($type == 0){
                return mb_convert_encoding($msg, "UTF-8", "auto");
            } elseif($type == 1){
                return imap_8bit($msg); //imap_utf8
            } elseif($type == 2){
                return imap_base64(imap_binary($msg));
            } elseif($type == 3){
                return imap_base64($msg);
            } elseif($type == 4){
                return imap_qprint($msg);
                //return quoted_printable_decode($msg);
            } else {
                return $msg;
            }
        }

and you can call this function like so 你可以像这样调用这个函数

$struct = imap_fetchstructure($conn, $messageNumber, 0);
$message = imap_fetchbody($conn, $messageNumber, 1);
$message = _encodeMessage($message, $struct->encoding);
echo $message;

I hope this helps someone :) 我希望这可以帮助别人 :)

$structure = imap_fetchstructure; NOT $encoding = $structure->encoding BUT $encoding = $structure->parts[ $p ]->encoding NOT $encoding = $structure->encoding BUT $encoding = $structure->parts[ $p ]->encoding

I think I had the same problem, now it's solved. 我想我遇到了同样的问题,现在已经解决了。 (7bit didn't convert to UTF-8, kept getting ASCII) I thought I had 7bit, but changing the code to "BUT" I got $encoding=4 , not $encoding=0 which means that I have to imap_qprint($body) and mb_convert_encoding($body, 'UTF-8', $charset) to get what I wanted. (7bit没有转换为UTF-8,保持得到ASCII)我以为我有7bit,但是将代码更改为“BUT”我得到$encoding=4 ,而不是$encoding=0这意味着我必须使用imap_qprint($body)mb_convert_encoding($body, 'UTF-8', $charset)得到我想要的东西。

Anyway check the encoding number!! 无论如何检查编码号码!! ( should be 4 not zero ) (应该是4而不是零)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM