简体   繁体   English

无法使用Text :: CSV_XS Perl模块写入UTF-16LE编码的CSV文件

[英]Cannot write UTF-16LE encoded CSV file with Text::CSV_XS Perl module

I want to write a CSV file encoded in UTF-16LE. 我想编写一个以UTF-16LE编码的CSV文件。 However, the output in the file gets messed up. 但是,文件中的输出混乱了。 There are strange chinese looking letters: ਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ. 有一些奇怪的中文字母:਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ。

This looks like off-by-one-byte problem mentioned here: Creating UTF-16 newline characters in Python for Windows Notepad 这看起来像是这里提到的逐字节问题: 在Windows记事本中的Python中创建UTF-16换行符

Other threads about Perl and Text::CSV_XS didn't help. 有关Perl和Text :: CSV_XS的其他线程没有帮助。

This is how I try it: 这是我尝试的方法:

#!perl

use strict;
use warnings;
use utf8;
use Text::CSV_XS;

binmode STDOUT, ":utf8";

my $csv = Text::CSV_XS->new({
    binary => 1,
    sep_char => ";",
    quote_char => undef,
    eol => $/,
});

open my $in, '<:encoding(UTF-16LE)', 'in.csv' or die "in.csv: $!";
open my $out, '>:encoding(UTF-16LE)', 'out.csv' or die "out.csv: $!";

while (my $row = $csv->getline($in)) {
    $_ =~ s/ä/æ/ for @$row; # something will be done to the data...
    $csv->print($out, $row);
}


close $in;
close $out;

in.csv contains some test data and it is encoded in UTF-16LE: in.csv包含一些测试数据,并以UTF-16LE进行编码:

header1;header2;
cell1.1;cell1.2;
äöü2.1;ab"c2.2;

The results looks like this: 结果看起来像这样:

header1;header2;਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ
æöü2.1;abc2.2;਍

It is not an option to switch to UTF-8 as output format (which works fine btw). 不能选择将UTF-8作为输出格式(顺便说一句)。

So, how do I write valid UTF-16LE encoded CSV files using Text::CSV_XS? 那么,如何使用Text :: CSV_XS编写有效的UTF-16LE编码的CSV文件?

Perl adds :crlf by default on Windows. 在Windows上,Pe​​rl默认添加:crlf It's added first, before your :encoding is added. 首先添加它,然后再添加:encoding

That means LF⇔CRLF conversion will be performed before decoding on reads, and after encoding on writes. 这意味着将在读取解码之前和写入编码之后执行LF⇔CRLF转换。 This is backwards. 这是倒退。

It ends up working with UTF-8 despite being done backwards because all of the following conditions are met: 尽管已经完成了向后的操作,但最终还是使用UTF-8,因为满足以下所有条件:

  • The UTF-8 encoding of LF is the same as its Code Point (0A). LF的UTF-8编码与其代码点(0A)相同。
  • The UTF-8 encoding of CR is the same as its Code Point (0D). CR的UTF-8编码与其代码点(0D)相同。
  • 0A always refers to LF no matter where they are in the file. 不管它们在文件中的什么位置,0A始终引用LF。
  • 0D always refers to CR no matter where they are in the file. 0D始终引用CR,无论它们在文件中的何处。

None of those conditions holds true for UTF-16le. 这些条件都不适用于UTF-16le。

Fix: 固定:

open(my $fh_in,  '<:raw:encoding(UTF-16LE):crlf', $qfn_in)
open(my $fh_out, '>:raw:encoding(UTF-16LE):crlf', $qfn_out)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM