简体   繁体   中英

How can I reformat messages in an mbox file with bash or Perl?

I have a huge mbox file, with maybe 500 emails in it.

It looks like the following:

From x@blah.com Fri Aug 12 09:34:09 2005
Message-ID: <42FBEE81.9090701@blah.com>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <x@blah.com>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <someone@hotmail.com>
Subject: Re: (no subject)
References: <BAY101-F9353854000A4758A7E2CCA9BD0@phx.gbl>
In-Reply-To: <BAY101-F9353854000A4758A7E2CCA9BD0@phx.gbl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: 
X-Keywords:                 
X-UID: 371
X-Evolution-Source: imap://x+blah.com@blah.com/
X-Evolution: 00000002-0010

Hey

the actual content of the email

someone wrote:

> lines of quotedtext

I would like to know how I can remove all of the quoted text, strip most of the headers except the To, From and Date lines, and still have it somewhat continuous.

My goal is to be able to print these emails as a book sort of format, and at the moment every program wants to print one email per page, or all of the headers and quoted text. Any suggestions for where to start on whipping up a small program using shell tools?

Mail::Box::Mbox will let you easily parse the file into separate messages. Mark Overmeer's slides from YAPC::Europe 2002 go into quite a bit of detail as to why parsing is much more difficult than it seems. Using this library will also deal with mh, IMAP and many other formats than just mbox.

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Mail::Box::Manager;

    my $file = shift || $ENV{MAIL};
    my $mgr = Mail::Box::Manager->new(
        access      => 'r',
    );

    my $folder = $mgr->open( folder => $file )
    or die "$file: Unable to open: $!\n";

    for my $msg ($folder->messages)
    {
        my $to      = join( ', ', map { $_->format } $msg->to );
        my $from    = join( ', ', map { $_->format } $msg->from );
        my $date    = localtime( $msg->timestamp );
        my $subject = $msg->subject;
        my $body    = $msg->body;

        # Strip all quoted text
        $body =~ s/^>.*$//msg;

        print <<"";
    From: $from
    To: $to
    Date: $date
    $body

    }

You may want to reconsider your request to strip the quoted text -- what if you email that is formatted with interleaved replies? Stripping the quoted text would make this sort of email very hard to understand:

Foo wrote:
  > I like bar.

  Bar?  Who likes bar?

  > It is better than baz.

  Everyone knows that.

  -- 
  Quux

Additionally, what do you plan to do with attachments, non-text/plain MIME types, encoded text entities and other oddities?

As a start, I would probably use "formail" to extract the mails with just the headers you want. Either that, or use some sort of state table in awk to see if you're in the header or not, and either strip everything but the wanted headers if you're in the header and strip the quotes if you're not.

Using shell tools may not be the best answer to that as there are many libraries in many languages to deal with mbox, be it in Ruby, Perl or whatever. You will have to also consider that quoting characters are not always "> " which can screw up your de-quoting process. As for extracting the headers you want, this should not be difficult in any language. I know this is a general answer, maybe not specific enough...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM