简体   繁体   English

如何在 perl 正则表达式替换命令中使用 unicode 字符?

[英]How can I use unicode characters in perl regex substitution command?

This doesn't work when using unicode characters (in Ubuntu bash):这在使用 unicode 字符时不起作用(在 Ubuntu bash 中):

$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a

Even though it seems to be supported by PCRE (at least according to regex101 ).即使 PCRE 似乎支持它(至少根据regex101 )。

What am I doing wrong?我究竟做错了什么? Am I missing some flag in the perl command?我是否在 perl 命令中遗漏了一些标志?

This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line... but I still want to know why the perl command is not working.这在 javascript 中“正常工作”,所以如果我能在命令行中为此想出一个简单的单行代码,我将使用节点......但我仍然想知道为什么 perl 命令不起作用。


For context:对于上下文:

I'm trying to use substitutions like /[àâáãä]/a/g , /[òôóõö]/o/g , etc to asciify a dictionary file (ie remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (eg in IntelliJ Idea).我正在尝试使用/[àâáãä]/a/g/[òôóõö]/o/g等替换来关联字典文件(即删除单词列表的重音等),所以我可以使用它使拼写检查不区分重音(例如在 IntelliJ Idea 中)。

Basically these are the steps to make an "asciified" extra dictionary:基本上这些是制作“asciified”额外字典的步骤:

  1. Download the.dic file for the language (list of all words)下载该语言的 .dic 文件(所有单词的列表)
  2. Use grep to filter words containing non-ascii / replaceable characters使用 grep 过滤包含非 ascii / 可替换字符的单词
  3. Use regex substitutions in succession to make words accent-insensitive连续使用正则表达式替换使单词不区分重音
  4. Import the asciified.dic file in the IDE (in addition to the standard language dictionary)导入IDE中的asciified.dic文件(标准语言词典除外)

You need to add -Mutf8 to tell Perl the program is encoded using UTF-8 rather than ASCII.您需要添加-Mutf8来告诉 Perl 程序是使用 UTF-8 而不是 ASCII 编码的。

$ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a

perl -p -e's/à/a/gu' <<< 'à' worked for me. perl -p -e's/à/a/gu' <<< 'à'为我工作。

One practical approach -- use Text::Unidecode一种实用的方法——使用Text::Unidecode

perl -C -MText::Unidecode -pe'unidecode($_)'  <<< 'à'

Prints a .打印a .

Another approach: decompose characters ("normalize") using Unicode::Normalize , so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme , then remove the diacriticals ( \p{NonspacingMark} or \p{Mn} ) with a simple regex.另一种方法:使用Unicode::Normalize分解字符(“规范化”),以便将字符及其变音符号(组合重音)分隔成自己的代码点,同时它们仍然形成有效的字形,然后删除变音符号( \p{NonspacingMark}\p{Mn} )与一个简单的正则表达式。

Both of these ways will have exceptions and edge cases but I think it may just do what you need.这两种方式都会有例外和边缘情况,但我认为它可能只是做你需要的。

Here's how I implemented steps 2 and 3.以下是我实施步骤 2 和 3 的方式。
This can be used, eg, in these dictionaries (though I didn't test it on every language).例如,这可以在这些字典中使用(尽管我没有在每种语言上都对其进行测试)。

asciify-dic

#!/usr/bin/env bash
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD>
if [[ "$1" == "--help" ]]; then
  echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE"
  echo "Asciify a .dic file (list of dictionary words)."
  echo ""
  echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars."
  echo "These additional words can be used to make spell-checking accent-insensitive."
  echo "Comment lines beginning with % are left unchanged."
  exit
fi
# Filter words containing non-ascii characters, except in comments
grep -P '^\%|[^\x00-\x7F]' $1 |
# Make words accent-insensitive, except in comments
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' |
# Remove duplicate lines, except in comments
awk '/^\s*%/||!seen[$0]++'

Example usage:示例用法:

asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic

The short answer is to add -Mutf8 to your command line.简短的回答是将-Mutf8添加到您的命令行中。

If you're not sure how Perl is interpreting what you wrote on the command line you can make it spit it back to you with the core B::perlstring() function or deparse the whole script with B::Deparse .如果您不确定 Perl 如何解释您在命令行上写的内容,您可以使用核心B::perlstring() function 或使用B::Deparse解析整个脚本。 That would illustrate your problem real fast.那将真正快速地说明您的问题。 (Enclosing the 'à' character in brackets doesn't do anything here.) (将 'à' 字符括在括号中在这里没有任何作用。)

$ perl -MO=Deparse -pC -e 's/à/a/gu' <<< 'à'

LINE: while (defined($_ = <ARGV>)) {
    s/\303\240/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

See how your substitution stragely has 2 characters in it?看看你的替换如何巧妙地包含 2 个字符?

You can then see immediately how use utf8 fixes your problem.然后您可以立即看到use utf8如何解决您的问题。

$ perl -MO=Deparse -Mutf8 -pC -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
    s/\340/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

You can use perlstring() to make sure Perl is receiving the input you think.您可以使用perlstring()来确保 Perl 正在接收您认为的输入。

$ perl -p -MB -E 'say B::perlstring($_)' <<< 'à'
"\303\240\n"
à
$ perl -pC -MB -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à

You can see that without -C Perl is receiving 2 decomposed characters.您可以看到没有-C Perl 正在接收 2 个分解的字符。

Depending on the circumstances, Perl dumps characters as either an octal code ( \340 ) or a hexadecimal code ( \xE0 ).根据具体情况,Perl 将字符转储为八进制代码 ( \340 ) 或十六进制代码 ( \xE0 )。 Note well here that you can always replace raw unicode characters in your command line with the escape code version.请注意,您始终可以将命令行中的原始 unicode 字符替换为转义码版本。 This is a great way to make explicit what otherwise would be ambiguous.这是一个很好的方式来明确什么否则会模棱两可。

$ perl -pC -e 's/[\xE0]/a/gu' <<< 'à'
a

If you don't want to have to remember UTF8 mode, you can shove those options in the PERL5OPT environment variable or create a shell alias.如果您不想记住 UTF8 模式,您可以在PERL5OPT环境变量中添加这些选项或创建 shell 别名。 Beware of making this global!小心把它变成全球性的!

$ export PERL5OPT='-C -Mutf8'
$ perl -MO=Deparse -p -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
    s/\340/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

$ perl -MB -p -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à

Or as a shell alias.或作为 shell 别名。

alias uperl='perl -C -Mutf8'

See perlrun for more information on how to Swiss Army Chainsaw the command line.有关如何使用 Swiss Army Chainsaw 命令行的更多信息,请参阅perlrun

See also B::Deparse .另请参见B::Deparse

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM