如何在 perl 正则表达式替换命令中使用 unicode 字符？

Question

This doesn't work when using unicode characters (in Ubuntu bash):这在使用 unicode 字符时不起作用（在 Ubuntu bash 中）：

$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a

Even though it seems to be supported by PCRE (at least according to regex101 ).即使 PCRE 似乎支持它（至少根据regex101 ）。

What am I doing wrong?我究竟做错了什么？ Am I missing some flag in the perl command?我是否在 perl 命令中遗漏了一些标志？

This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line... but I still want to know why the perl command is not working.这在 javascript 中“正常工作”，所以如果我能在命令行中为此想出一个简单的单行代码，我将使用节点......但我仍然想知道为什么 perl 命令不起作用。

For context:对于上下文：

I'm trying to use substitutions like /[àâáãä]/a/g , /[òôóõö]/o/g , etc to asciify a dictionary file (ie remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (eg in IntelliJ Idea).我正在尝试使用/[àâáãä]/a/g 、 /[òôóõö]/o/g等替换来关联字典文件（即删除单词列表的重音等），所以我可以使用它使拼写检查不区分重音（例如在 IntelliJ Idea 中）。

Basically these are the steps to make an "asciified" extra dictionary:基本上这些是制作“asciified”额外字典的步骤：

Download the.dic file for the language (list of all words)下载该语言的 .dic 文件（所有单词的列表）
Use grep to filter words containing non-ascii / replaceable characters使用 grep 过滤包含非 ascii / 可替换字符的单词
Use regex substitutions in succession to make words accent-insensitive连续使用正则表达式替换使单词不区分重音
Import the asciified.dic file in the IDE (in addition to the standard language dictionary)导入IDE中的asciified.dic文件（标准语言词典除外）

Answer 1

You need to add -Mutf8 to tell Perl the program is encoded using UTF-8 rather than ASCII.您需要添加-Mutf8来告诉 Perl 程序是使用 UTF-8 而不是 ASCII 编码的。

$ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a

Answer 2

perl -p -e's/à/a/gu' <<< 'à' worked for me. perl -p -e's/à/a/gu' <<< 'à'为我工作。

Answer 3

One practical approach -- use Text::Unidecode一种实用的方法——使用Text::Unidecode

perl -C -MText::Unidecode -pe'unidecode($_)'  <<< 'à'

Prints a .打印a .

Another approach: decompose characters ("normalize") using Unicode::Normalize , so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme , then remove the diacriticals ( \p{NonspacingMark} or \p{Mn} ) with a simple regex.另一种方法：使用Unicode::Normalize分解字符（“规范化”），以便将字符及其变音符号（组合重音）分隔成自己的代码点，同时它们仍然形成有效的字形，然后删除变音符号（ \p{NonspacingMark}或\p{Mn} ）与一个简单的正则表达式。

Both of these ways will have exceptions and edge cases but I think it may just do what you need.这两种方式都会有例外和边缘情况，但我认为它可能只是做你需要的。

Answer 4

Here's how I implemented steps 2 and 3.以下是我实施步骤 2 和 3 的方式。
This can be used, eg, in these dictionaries (though I didn't test it on every language).例如，这可以在这些字典中使用（尽管我没有在每种语言上都对其进行测试）。

asciify-dic

#!/usr/bin/env bash
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD>
if [[ "$1" == "--help" ]]; then
  echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE"
  echo "Asciify a .dic file (list of dictionary words)."
  echo ""
  echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars."
  echo "These additional words can be used to make spell-checking accent-insensitive."
  echo "Comment lines beginning with % are left unchanged."
  exit
fi
# Filter words containing non-ascii characters, except in comments
grep -P '^\%|[^\x00-\x7F]' $1 |
# Make words accent-insensitive, except in comments
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' |
# Remove duplicate lines, except in comments
awk '/^\s*%/||!seen[$0]++'

Example usage:示例用法：

asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic

Answer 5

The short answer is to add -Mutf8 to your command line.简短的回答是将-Mutf8添加到您的命令行中。

If you're not sure how Perl is interpreting what you wrote on the command line you can make it spit it back to you with the core B::perlstring() function or deparse the whole script with B::Deparse .如果您不确定 Perl 如何解释您在命令行上写的内容，您可以使用核心B::perlstring() function 或使用B::Deparse解析整个脚本。 That would illustrate your problem real fast.那将真正快速地说明您的问题。 (Enclosing the 'à' character in brackets doesn't do anything here.) （将 'à' 字符括在括号中在这里没有任何作用。）

$ perl -MO=Deparse -pC -e 's/à/a/gu' <<< 'à'


LINE: while (defined($_ = <ARGV>)) {
    s/\303\240/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

See how your substitution stragely has 2 characters in it?看看你的替换如何巧妙地包含 2 个字符？

You can then see immediately how use utf8 fixes your problem.然后您可以立即看到use utf8如何解决您的问题。

$ perl -MO=Deparse -Mutf8 -pC -e 's/à/a/gu' <<< 'à'

use utf8;
LINE: while (defined($_ = <ARGV>)) {
    s/\340/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

You can use perlstring() to make sure Perl is receiving the input you think.您可以使用perlstring()来确保 Perl 正在接收您认为的输入。

$ perl -p -MB -E 'say B::perlstring($_)' <<< 'à'
"\303\240\n"
à

$ perl -pC -MB -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à

You can see that without -C Perl is receiving 2 decomposed characters.您可以看到没有-C Perl 正在接收 2 个分解的字符。

Depending on the circumstances, Perl dumps characters as either an octal code ( \340 ) or a hexadecimal code ( \xE0 ).根据具体情况，Perl 将字符转储为八进制代码 ( \340 ) 或十六进制代码 ( \xE0 )。 Note well here that you can always replace raw unicode characters in your command line with the escape code version.请注意，您始终可以将命令行中的原始 unicode 字符替换为转义码版本。 This is a great way to make explicit what otherwise would be ambiguous.这是一个很好的方式来明确什么否则会模棱两可。

$ perl -pC -e 's/[\xE0]/a/gu' <<< 'à'
a

If you don't want to have to remember UTF8 mode, you can shove those options in the PERL5OPT environment variable or create a shell alias.如果您不想记住 UTF8 模式，您可以在PERL5OPT环境变量中添加这些选项或创建 shell 别名。 Beware of making this global!小心把它变成全球性的！

$ export PERL5OPT='-C -Mutf8'
$ perl -MO=Deparse -p -e 's/à/a/gu' <<< 'à'

use utf8;
LINE: while (defined($_ = <ARGV>)) {
    s/\340/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

$ perl -MB -p -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à

Or as a shell alias.或作为 shell 别名。

alias uperl='perl -C -Mutf8'

See perlrun for more information on how to Swiss Army Chainsaw the command line.有关如何使用 Swiss Army Chainsaw 命令行的更多信息，请参阅perlrun 。

See also B::Deparse .另请参见B::Deparse 。

如何在 perl 正则表达式替换命令中使用 unicode 字符？

问题描述

4 个解决方案

解决方案1
4 2021-12-15 08:37:40

解决方案2
0 2021-12-15 04:09:22

解决方案3
0 2021-12-15 08:11:57

解决方案4
0 2021-12-17 05:49:23

解决方案5
0 2021-12-17 06:56:12

如何在 perl 正则表达式替换命令中使用 unicode 字符？

问题描述

4 个解决方案

解决方案1 4 2021-12-15 08:37:40

解决方案2 0 2021-12-15 04:09:22

解决方案3 0 2021-12-15 08:11:57

解决方案4 0 2021-12-17 05:49:23

解决方案5 0 2021-12-17 06:56:12

解决方案1
4 2021-12-15 08:37:40

解决方案2
0 2021-12-15 04:09:22

解决方案3
0 2021-12-15 08:11:57

解决方案4
0 2021-12-17 05:49:23

解决方案5
0 2021-12-17 06:56:12