简体   繁体   English

在命令行中有条件地在perl中将默认编码设置为utf-8

[英]Setting default encoding to utf-8 in perl conditionally on command-line option

In order to process text in utf-8 in Perl, I have been using binmode(<file-handle>, ":encoding(UTF-8)"); 为了在Perl的utf-8中处理文本,我一直在使用binmode(<file-handle>, ":encoding(UTF-8)"); on each stream I use. 在我使用的每个流上。 I just discovered that 我才发现

use open ( ":encoding(UTF-8)", ":std" );

can be used to do the same thing globally. 可以用来全局地做同样的事情。 This is great, since it means a lot less repetitive code. 这很棒,因为它意味着更少的重复代码。

But now I have a problem: I would like to have a command line option to my script, -utf8 , which turns everything utf-8 only when supplied. 但是现在我有一个问题:我想在脚本中有一个命令行选项-utf8 ,该选项仅在提供时才转换为utf-8。 Since use open is a pragma, it is lexically scoped and I cannot put it in an if statement, but without an if statement it cannot depend on command line options. 由于use open是一个use open ,因此它在词法范围内,我无法将其放在if语句中,但是如果没有if语句,它就不能依赖命令行选项。

Here is a minimal example illustrating the problem, call it problem.pl 这是说明问题的最小示例,将其称为问题。

#!/usr/bin/env perl

# hard-coded in my minimal example, normally set by command line option -utf8
my $use_utf8 = 1;

# use only applies within its lexical scope - this does not work
if ($use_utf8) {
   use open ( ":encoding(UTF-8)", ":std" );
}

# if I put it at the right lexical scope, it's not conditional on $use_utf8
#..e open ( ":encoding(UTF-8)", ":std" );

while (<>) {
   print length($_);
}

When I run this code on a file, call in input , containing one line with a 2-byte UTF-8 character, say à , it outputs 3: 当我在文件上运行此代码时,请调用input ,其中包含一行带有2字节UTF-8字符的行,例如à ,它将输出3:

$ ./problem.pl input
3

If I move the use open statement to the global scope, I get the expected results of a length of 2 (one character plus one newline): 如果将use open语句移到全局范围,则会得到预期的结果,该结果的长度为2(一个字符加一个换行符):

$ ./problem.pl input
2

So how can I set the encoding to utf-8 globally, but conditionally on a command-line option, so that I would get 2 with -utf8 but 3 without. 因此,如何在全局上将编码设置为utf-8,但有条件地在命令行选项上进行设置,以便使用-utf8可以得到2,而没有使用-utf8可以得到3。

Also, in my real use case, I use the spaceship operator ( while (<>) ) to provide high flexibility in the command line syntax to process multiple files, but in this case I can't call binmode since the file handles are managed automatically by Perl. 另外,在我的实际用例中,我使用太空飞船运算符( while (<>) )在命令行语法中提供了高度灵活性,以处理多个文件,但是在这种情况下,由于文件句柄是托管的,因此我无法调用binmode由Perl自动执行。 use open would be a much nicer option, if I could make it conditional. 如果可以有条件的话, use open将是一个更好的选择。

PS: Yes, I really do still have non-utf8 data that I want to continue to be able to handle. PS:是的,我确实确实还有我想继续处理的非utf8数据。 Thank God most of our data is now in utf-8, but unfortunately not all of it yet. 谢天谢地,我们大多数数据现在都保存在utf-8中,但不幸的是还不是全部。

First: you can use if to conditionally apply a lexical pragma. 首先:您可以使用if来有条件地应用词汇用法。 Just make sure the condition is available at compile time (you may need to use a BEGIN block before). 只要确保条件在编译时可用即可(您可能需要在之前使用BEGIN块)。

my $use_utf8;
BEGIN { $use_utf8 = 1; }
use if $use_utf8, 'open', ':std', ':encoding(UTF-8)';

The -C option works similarly to the open pragma for utf8 layers. -C选项的工作方式类似于utf8层的打开编译指示。 -CSD will set it on the standard handles (S) and any handles opened (D). -CSD将其设置在标准手柄(S)和任何打开的手柄(D)上。 Unfortunately it uses the less safe :utf8 layer instead of :encoding(UTF-8) , so you may end up with broken strings if you use it for input that is not actually UTF-8. 不幸的是,它使用的是不太安全的:utf8层,而不是:encoding(UTF-8) ,因此,如果将其用于实际上不是UTF-8的输入,则可能会导致字符串损坏。 Also, -CD sets a default for any handles opened in the whole program, not just the lexical scope of your script, this can possibly break usage of modules that don't expect it. 同样, -CD为整个程序中打开的所有句柄设置默认值,而不仅仅是脚本的词法范围,这可能会中断不需要它的模块的使用。 ( -CS is always global, as is the ':std' effect of the open pragma, since the standard handles are global.) -CS始终是全局的,就像打开编译指示的':std'效果一样,因为标准句柄是全局的。)

perl -CSD problem.pl input

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM