简体   繁体   English

您如何使用Perl更改文件的编码?

[英]How do you change the encoding of a file, using Perl?

I'm writing a perl script that creates an xml file "settings.xml". 我正在编写一个Perl脚本,该脚本创建一个xml文件“ settings.xml”。 (Using XML::Writer). (使用XML :: Writer)。 I'd like the file to be encoded in UCS-2 big endian, but I'm unsure of how. 我希望文件以UCS-2大端编码,但是我不确定如何。

I've tried things like: open(my $output, "> :encoding(UCS-2BE)", "settings.xml"); 我已经尝试过类似的事情: open(my $output, "> :encoding(UCS-2BE)", "settings.xml"); , but all that does is make the file output a big mess,(eg either http://i.imgur.com/p9cruCf.png or a series of chinese characters) while keeping the encoding of the file as ANSI. ,但所做的只是使文件输出混乱(例如http://i.imgur.com/p9cruCf.png或一系列中文字符),同时将文件编码保持为ANSI。

Any idea how to fix this, or alternatively, how to convert a file into UCS-2? 任何想法如何解决此问题,或者如何将文件转换为UCS-2?

I'm a beginner at Perl, sorry if some of this doesn't make sense. 我是Perl的初学者,很抱歉,如果其中的某些步骤没有意义。

EDIT: for anyone else encountering this problem, please see the answers below, they provide a thorough explanation of how to fix it. 编辑:对于其他任何遇到此问题的人,请参见下面的答案,他们提供了如何解决此问题的详尽说明。

XML::Writer doesn't support anything but US-ASCII and UTF-8 (as mentioned in the documentation of its ENCODING constructor argument). XML :: Writer只支持US-ASCII和UTF-8(如其ENCODING构造函数参数的文档所述)。 Creating an UCS-2be XML document using XML::Writer is tricky, but not impossible. 使用XML :: Writer创建UCS-2be XML文档很棘手,但并非没有可能。

use XML::Writer qw( );

# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
   or die("Can't create \"$qfn\": $!\n");

# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");

my $writer = XML::Writer->new(
   OUTPUT   => $fh,
   ENCODING => 'US-ASCII',   # Use entities for > U+007F
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
$writer->characters("\x{10000}");
$writer->endTag();
$writer->end();

Downside: All characters above U+007F will be present as XML entities. 缺点:U + 007F上方的所有字符都将显示为XML实体。 In the above example, 在以上示例中,

  • U+00041 will be present as " A " ( 00 41 ). U + 00041将显示为“ A ”( 00 41 )。 Good. 好。
  • U+000C9 will be present as " É " ( 00 26 00 23 00 78 00 43 00 39 00 3B ). U + 000C9将显示为“ É ”( 00 26 00 23 00 78 00 43 00 39 00 3B )。 Suboptimal, but ok. 次优,但还可以。
  • U+10000 will be present as " 𐀀 " ( 00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B ). U + 10000将显示为“ 𐀀 ”( 00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B )。 Good, XML entities are needed to store U+10000 with UCB-2e . 好的,需要使用XML实体将U + 10000与UCB-2e存储在一起。

You can avoid the downside mentioned above if and only if you can guarantee that no character above U+FFFF will be provided to the writer. 当且仅当您可以保证不会向写程序提供U + FFFF以上的字符时,才可以避免上述缺点。

use XML::Writer qw( );

# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
   or die("Can't create \"$qfn\": $!\n");

# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");

my $writer = XML::Writer->new(
   OUTPUT   => $fh,
   ENCODING => 'UTF-8',   # Don't use entities.
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
#$writer->characters("\x{10000}");  # This causes a fatal error
$writer->endTag();
$writer->end();
  • U+00041 will be present as " A " ( 00 41 ). U + 00041将显示为“ A ”( 00 41 )。 Good. 好。
  • U+000C9 will be present as " É " ( 00 C9 ). U + 000C9将显示为“ É ”( 00 C9 )。 Good. 好。
  • U+10000 causes a fatal error. U + 10000会导致致命错误。

And here's how you can do it without any of the downsides: 这是没有任何缺点的方法:

use Encode      qw( decode encode );
use XML::Writer qw( );

my $xml;
{
   # XML::Writer doesn't encode for you, so we need to use :encoding.
   open(my $fh, '>:encoding(UTF-8)', \$xml);

   # This prints the BOM. It's optional, but it's useful when using an
   # encoding that's not a superset of US-ASCII (such as UCS-2be).
   print($fh "\x{FEFF}");

   my $writer = XML::Writer->new(
      OUTPUT   => $fh,
      ENCODING => 'UTF-8',   # Don't use entities.
   );
   $writer->xmlDecl('UCS-2be');
   $writer->startTag('root');
   $writer->characters("\x{00041}");
   $writer->characters("\x{000C9}");
   $writer->characters("\x{10000}");
   $writer->endTag();
   $writer->end();
   close($fh);
}

# Fix encoding.
$xml = decode('UTF-8', $xml);
$xml =~ s/([^\x{0000}-\x{FFFF}])/ sprintf('&#x%X;', ord($1)) /eg;
$xml = encode('UCS-2be', $xml);

open(my $fh, '>:raw', $qfn)
   or die("Can't create \"$qfn\": $!\n");

print($fh $xml);
  • U+00041 will be present as " A " ( 00 41 ). U + 00041将显示为“ A ”( 00 41 )。 Good. 好。
  • U+000C9 will be present as " É " ( 00 C9 ). U + 000C9将显示为“ É ”( 00 C9 )。 Good. 好。
  • U+10000 will be present as " 𐀀 " ( 00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B ). U + 10000将显示为“ 𐀀 ”( 00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B )。 Good, XML entities are needed to store U+10000 with UCB-2e . 好的,需要使用XML实体将U + 10000与UCB-2e存储在一起。

You don't describe what goes wrong, but you may be running into a bug some perl versions had on Windows with bad interaction between the encoding and crlf layers. 您没有描述出什么问题,但是您可能会遇到一些Perl版本在Windows上存在的错误,并且编码层和crlf层之间的交互不良。 If so, this should work: 如果是这样,这应该起作用:

open(my $output, "> :raw:perlio:encoding(UCS-2BE):crlf:utf8", "settings.xml");

(See http://www.perlmonks.org/?node_id=608532 for an explanation.) (有关说明,请参见http://www.perlmonks.org/?node_id=608532 。)

If not, please provide more information than "all that does is make the file output a big mess". 如果不是,请提供更多信息,而不是“所有操作都会使文件输出变得混乱”。 A short script demonstrating the problem would be helpful. 演示该问题的简短脚本将很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM