如何处理其中具有多种编码的文件？

Question

I have a small program to order and sort email messages, outputting to a textfile using $msg->decoded->string . 我有一个小程序来订购和排序电子邮件，并使用$msg->decoded->string输出到文本文件。 The perl program outputs to stdout , and I redirect it to a txt file. perl程序输出到stdout ，我将其重定向到txt文件。 However, gedit is unable to open this text file because of a character set problem, and I would like to know how to restore or set a character set with perl. 但是，由于字符集问题，gedit无法打开此文本文件，我想知道如何使用perl恢复或设置字符集。

The program is now thus: 现在，该程序为：

#!/usr/bin/perl
use warnings;
use strict;
use Mail::Box::Manager;

open (MYFILE, '>>data.txt');

my $file = shift || $ENV{MAIL};
my $mgr = Mail::Box::Manager->new(
    access          => 'r',
);

my $folder = $mgr->open( folder => $file )
or die "$file: Unable to open: $!\n";

for my $msg ( sort { $a->timestamp <=> $b->timestamp } $folder->messages)
{
    my $to          = join( ', ', map { $_->format } $msg->to );
    my $from        = join( ', ', map { $_->format } $msg->from );
    my $date        = localtime( $msg->timestamp );
    my $subject     = $msg->subject;
    my $body        = $msg->decoded->string;

    # Strip all quoted text
    $body =~ s/^>.*$//msg;

    print MYFILE <<"";
From: $from
To: $to
Date: $date
$body

}

However I get the same problem that I am unable to open the file with gedit, even though it works with vi or such. 但是，我遇到了同样的问题，即使它可以与vi等兼容，也无法使用gedit打开文件。 If there are non unicode characters in the file, would this break it? 如果文件中包含非Unicode字符，这会破坏它吗？

Answer 1

Different messages probably are in different encodings. 不同的消息可能采用不同的编码。 Probably gedit detects it as UTF-8, but later finds out that parts of the file aren't UTF-8. gedit可能将其检测为UTF-8，但后来发现文件的某些部分不是UTF-8。 Mixed files like this are major PITA. 这样的混合文件是主要的PITA。

The best (perhaps only) solution is to check for the content type ( $message->contentType ) and convert everything to UTF-8. 最好的（也许是唯一的）解决方案是检查内容类型（ $message->contentType ）并将所有$message->contentType转换为UTF-8。

Answer 2

If you are simply redirecting Perl's output, then Perl will have a difficult time producing a decent file. 如果您只是重定向Perl的输出，那么Perl将很难生成一个不错的文件。

You should try writing the file directly from Perl. 您应该尝试直接从Perl写入文件。

You should also check whether you really have an encoding problem or whether characters that simply don't belong in your file still end up there. 您还应该检查您是否确实存在编码问题，或者文件中根本不存在的字符是否仍在那里。 Use vi or a hex editor or simply hexdump to do that. 使用vi或十六进制编辑器或简单地使用hexdump来执行此操作。

Answer 3

You can use the IO layers facility. 您可以使用IO层工具。 Open a file like this to specify the encoding: 打开这样的文件以指定编码：

open my $fh, '>:encoding(UTF-8)', $file;

Or you can use use binmode() to alter an already opened filehandle: 或者，您可以使用binbind（）更改已经打开的文件句柄：

binmode(STDOUT, ':encoding(UTF-8)');

Of course, you can set other encodings than utf8, and there's plenty of other options, too. 当然，您可以设置除utf8之外的其他编码，并且还有很多其他选项。 Just look in the documentations for open and binmode. 只需查看文档中的open和binmode。 Maybe IO::File is worth a look, too: 也许IO :: File也值得一看：

perldoc -f open
perldoc -f binmode
perldoc IO::File

如何处理其中具有多种编码的文件？

问题描述

3 个解决方案

解决方案1
3 已采纳 2008-12-15 15:42:08

解决方案2
1 2008-12-15 14:50:40

解决方案3
1 2008-12-15 15:27:17

如何处理其中具有多种编码的文件？

问题描述

3 个解决方案

解决方案1 3 已采纳 2008-12-15 15:42:08

解决方案2 1 2008-12-15 14:50:40

解决方案3 1 2008-12-15 15:27:17

解决方案1
3 已采纳 2008-12-15 15:42:08

解决方案2
1 2008-12-15 14:50:40

解决方案3
1 2008-12-15 15:27:17