[英]Setting default encoding to utf-8 in perl conditionally on command-line option
In order to process text in utf-8 in Perl, I have been using binmode(<file-handle>, ":encoding(UTF-8)");
为了在Perl的utf-8中处理文本,我一直在使用
binmode(<file-handle>, ":encoding(UTF-8)");
on each stream I use. 在我使用的每个流上。 I just discovered that
我才发现
use open ( ":encoding(UTF-8)", ":std" );
can be used to do the same thing globally. 可以用来全局地做同样的事情。 This is great, since it means a lot less repetitive code.
这很棒,因为它意味着更少的重复代码。
But now I have a problem: I would like to have a command line option to my script, -utf8
, which turns everything utf-8 only when supplied. 但是现在我有一个问题:我想在脚本中有一个命令行选项
-utf8
,该选项仅在提供时才转换为utf-8。 Since use open
is a pragma, it is lexically scoped and I cannot put it in an if statement, but without an if statement it cannot depend on command line options. 由于
use open
是一个use open
,因此它在词法范围内,我无法将其放在if语句中,但是如果没有if语句,它就不能依赖命令行选项。
Here is a minimal example illustrating the problem, call it problem.pl 这是说明问题的最小示例,将其称为问题。
#!/usr/bin/env perl
# hard-coded in my minimal example, normally set by command line option -utf8
my $use_utf8 = 1;
# use only applies within its lexical scope - this does not work
if ($use_utf8) {
use open ( ":encoding(UTF-8)", ":std" );
}
# if I put it at the right lexical scope, it's not conditional on $use_utf8
#..e open ( ":encoding(UTF-8)", ":std" );
while (<>) {
print length($_);
}
When I run this code on a file, call in input
, containing one line with a 2-byte UTF-8 character, say à
, it outputs 3: 当我在文件上运行此代码时,请调用
input
,其中包含一行带有2字节UTF-8字符的行,例如à
,它将输出3:
$ ./problem.pl input
3
If I move the use open
statement to the global scope, I get the expected results of a length of 2 (one character plus one newline): 如果将
use open
语句移到全局范围,则会得到预期的结果,该结果的长度为2(一个字符加一个换行符):
$ ./problem.pl input
2
So how can I set the encoding to utf-8 globally, but conditionally on a command-line option, so that I would get 2 with -utf8
but 3 without. 因此,如何在全局上将编码设置为utf-8,但有条件地在命令行选项上进行设置,以便使用
-utf8
可以得到2,而没有使用-utf8
可以得到3。
Also, in my real use case, I use the spaceship operator ( while (<>)
) to provide high flexibility in the command line syntax to process multiple files, but in this case I can't call binmode
since the file handles are managed automatically by Perl. 另外,在我的实际用例中,我使用太空飞船运算符(
while (<>)
)在命令行语法中提供了高度灵活性,以处理多个文件,但是在这种情况下,由于文件句柄是托管的,因此我无法调用binmode
由Perl自动执行。 use open
would be a much nicer option, if I could make it conditional. 如果可以有条件的话,
use open
将是一个更好的选择。
PS: Yes, I really do still have non-utf8 data that I want to continue to be able to handle. PS:是的,我确实确实还有我想继续处理的非utf8数据。 Thank God most of our data is now in utf-8, but unfortunately not all of it yet.
谢天谢地,我们大多数数据现在都保存在utf-8中,但不幸的是还不是全部。
First: you can use if to conditionally apply a lexical pragma. 首先:您可以使用if来有条件地应用词汇用法。 Just make sure the condition is available at compile time (you may need to use a BEGIN block before).
只要确保条件在编译时可用即可(您可能需要在之前使用BEGIN块)。
my $use_utf8;
BEGIN { $use_utf8 = 1; }
use if $use_utf8, 'open', ':std', ':encoding(UTF-8)';
The -C option works similarly to the open pragma for utf8 layers. -C选项的工作方式类似于utf8层的打开编译指示。
-CSD
will set it on the standard handles (S) and any handles opened (D). -CSD
将其设置在标准手柄(S)和任何打开的手柄(D)上。 Unfortunately it uses the less safe :utf8
layer instead of :encoding(UTF-8)
, so you may end up with broken strings if you use it for input that is not actually UTF-8. 不幸的是,它使用的是不太安全的
:utf8
层,而不是:encoding(UTF-8)
,因此,如果将其用于实际上不是UTF-8的输入,则可能会导致字符串损坏。 Also, -CD
sets a default for any handles opened in the whole program, not just the lexical scope of your script, this can possibly break usage of modules that don't expect it. 同样,
-CD
为整个程序中打开的所有句柄设置默认值,而不仅仅是脚本的词法范围,这可能会中断不需要它的模块的使用。 ( -CS
is always global, as is the ':std' effect of the open pragma, since the standard handles are global.) (
-CS
始终是全局的,就像打开编译指示的':std'效果一样,因为标准句柄是全局的。)
perl -CSD problem.pl input
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.