How do I extract e-mail attachments from raw e-mail without IMAP functions?

Question

The title pretty much says it all, but I'll try to flesh the issue a bit.

A PHP application of mine needs to read e-mails from a socket (this was a requirement) and then use some of those e-mails (having an api token) as articles in the application (it's a cms).

I've been able to get the reading part kind of going, but now we're stuck in parsing them; concretely our issue is that an e-mail I might receive will 99% of the time look like this:

MIME-Version: 1.0\r\n
Received: by {ip_number} with {protocol}; {iso_date}\r\n
Date: {iso_date}\r\n
Delivered-To: {destination}\r\n
Message-ID: {sample_message_id}\r\n
Subject: {subject}\r\n
From: {sender}\r\n
To: {destination}\r\n
Content-Type: multipart/mixed; boundary={sample_boundary}\r\n
\r\n
--{sample_boundary}\r\n
Content-Type: multipart/alternative; boundary={sample_boundary_2}\r\n
\r\n
--{sample_boundary_2}\r\n
Content-Type: text/plain; charset={charset}\r\n
\r\n
{file_content}\r\n
--\r\n
{signature}\r\n
\r\n
--{sample_boundary_2}\r\n
Content-Type: text/html; charset={charset}\r\n
\r\n
{content_html}\r\n
{signature_html}\r\n
--{sample_boundary_2}--\r\n
--{sample_boundary}\r\n
Content-Type: image/jpeg; name="{file_name}"\r\n
Content-Disposition: attachment; filename="{file_name}"\r\n
Content-Transfer-Encoding: base64\r\n
X-Attachment-Id: {sample_attachment_id}\r\n
\r\n
{quoted_printable_file_contents}\r\n
--{sample_boundary}--\r\n

And while I've been trying to regex them out I simply haven't been able to. The fact that standard e-mails should end their lines in \\n but some do in \\r\\n combined with the nesting thing is too much for me to handle.

There's a library in PHPClasses that splits e-mails into MIME parts (along with a bunch of other things), written by some Manuel Lemos guy who clearly knew what he was doing since it's really efficient and returns nicely formatted and parsed, but it doesn't cut it for me.

The library itself consists of +2500 lines of unintelligible gibberish I can't make any sense of (it being written in 3 different camelCases and using assorted indentation styles along with different types of ifs (like if(): and if() and if(){} and loops like for(;;) , for(){} and for(): does not make it much simpler)

Could anyone please give me a hand here?

Thank you very much!

-- Edited to add

Following Sjoern's advice I started building a solution to my own question (thanks!!). I'm still open to more suggestions though; surely there's better ways of doing it)

class MimePartsParser{  
  protected function hasContentType($string){
    return strtolower(trim(substr($string,0,14))) == 'content-type';
  }
  protected function hasTransferEncoding($string){
     return strpos($string, 'Content-Transfer-Encoding')!==false;
  }
  protected function getBoundary($from){
    preg_match('/boundary="(?P<boundary>(.*))"/', $from, $matches);
    if(isset($matches['boundary']) AND count($matches['boundary']>0)){
      return $matches['boundary'];
    }
  }
  protected function cleanMimePart($msg){
    $msg = trim($msg);
    return trim(substr(trim($msg),0,strlen(trim($msg))-3));
  }
  protected function parseMessage($msg){
    $parts = array(); 
    if($boundary = $this->getBoundary($msg)){
      $msgs = explode($boundary, $msg); 
      foreach($msgs as $msg){
        if($msg = $this->parseMessage($msg)){
          $parts []= $msg;
        }
      }
    }
    else{
      if($this->hasContentType($msg) AND $this->hasTransferEncoding($msg)){
        $parts []= $this->cleanMimePart($msg);
      }
    }
    return $parts;  
  }
  protected function flattenArray($array){
    $flat = array();
    foreach(new RecursiveIteratorIterator(new RecursiveArrayIterator($array)) as $key => $item){
      $flat []= $item;
    }
    return $flat;
  }
  public function parse($string){
    return $this->flattenArray($this->parseMessage($string));
  }
}
/*Usage example*/
$mimeParser = new MimePartsParser;
var_dump($mimeParser->parse(file_get_contents('sample.txt')));

Answer 1

Make a function which parses a message and recursively call it.

First, parse the whole message. If you encounter this:

Content-Type: multipart/mixed; boundary={sample_boundary}

Split the message on {sample_boundary} . Then parse each submessage.

function parseMessage($message) {
    // Put some code here to determine the split
    $messages = explode($boundary, $message);
    $result = array();
    foreach ($messages as $message) {
        $result[] = parseMessage($message);
    }
    return $result;
}

Answer 2

I know this question is old, but I just had to do this for attached PDFs without IMAP and without PEAR (darn cheap hosts).

This bit of code take a raw email message (in $email), goes through the message looking for attachments, if it finds one, it extracts it, decodes it and saves it. I would add some check to make sure the attachment is the type you want - eg 'pdf'.

It works for base64 pdf attachments sent from gmail - haven't tested with anything else. edit: now tested and works with emails originating from Yahoo.

(sorry, the lines are kinda long because I didn't move everything into variables).

USES THE MAILPARSER FUNCTION http://php.net/manual/en/ref.mailparse.php

//TAKES A RAW MESSAGE $email AND FINDS PART WITH ATTACHMENT, CROPS OUT ATTACHMENT, DECODES, SAVES.
$mailparse = mailparse_msg_create();
mailparse_msg_parse($mailparse,$email);
$structure = mailparse_msg_get_structure($mailparse); 

foreach($structure as $structurepart) { 

//THIS IS THE MODIFIED LINE TO CHECK FOR AN ATTACHMENT THAT IS A PDF
//if (mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['content-disposition']==='attachment' && mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['content-type']==='application/pdf' )

if (mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['content-disposition']==='attachment') {
    $startingposition = mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['starting-pos-body'];
    $length = mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['ending-pos-body'] - mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['starting-pos-body'];
    $filenameasreceived = mailparse_msg_get_part_data(mailparse_msg_get_part($mailparse, $structurepart))['disposition-filename'];

    $mime_pdf = substr( $email, $startingposition,$length); 
    $mime_pdf = base64_decode($mime_pdf);

    /* Saves the data into a file */
    $fdw = fopen('/home/[userfolder]/public_html/'. $filenameasreceived, "w+");
    fwrite($fdw, $mime_pdf);
    fclose($fdw);
    /* Script End */
    echo "<br>file saved.";
  }
  }

Answer 3

I had to change the 14 to 13 in the following function to get it work:

  protected function hasContentType($string){

          return strtolower(trim(substr($string,0,13))) == 'content-type';
  }

How do I extract e-mail attachments from raw e-mail without IMAP functions?

Question

3 answers

solution1
1 ACCPTED 2011-02-17 13:49:38

solution2
0 2016-05-09 17:19:12

solution3
0 2018-09-28 12:05:46

How do I extract e-mail attachments from raw e-mail without IMAP functions?

Question

3 answers

solution1 1 ACCPTED 2011-02-17 13:49:38

solution2 0 2016-05-09 17:19:12

solution3 0 2018-09-28 12:05:46

solution1
1 ACCPTED 2011-02-17 13:49:38

solution2
0 2016-05-09 17:19:12

solution3
0 2018-09-28 12:05:46