[英]How can I use unicode characters in perl regex substitution command?
This doesn't work when using unicode characters (in Ubuntu bash):这在使用 unicode 字符时不起作用(在 Ubuntu bash 中):
$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a
Even though it seems to be supported by PCRE (at least according to regex101 ).即使 PCRE 似乎支持它(至少根据regex101 )。
What am I doing wrong?我究竟做错了什么? Am I missing some flag in the perl command?
我是否在 perl 命令中遗漏了一些标志?
This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line... but I still want to know why the perl command is not working.这在 javascript 中“正常工作”,所以如果我能在命令行中为此想出一个简单的单行代码,我将使用节点......但我仍然想知道为什么 perl 命令不起作用。
For context:对于上下文:
I'm trying to use substitutions like /[àâáãä]/a/g
, /[òôóõö]/o/g
, etc to asciify a dictionary file (ie remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (eg in IntelliJ Idea).我正在尝试使用
/[àâáãä]/a/g
、 /[òôóõö]/o/g
等替换来关联字典文件(即删除单词列表的重音等),所以我可以使用它使拼写检查不区分重音(例如在 IntelliJ Idea 中)。
Basically these are the steps to make an "asciified" extra dictionary:基本上这些是制作“asciified”额外字典的步骤:
You need to add -Mutf8
to tell Perl the program is encoded using UTF-8 rather than ASCII.您需要添加
-Mutf8
来告诉 Perl 程序是使用 UTF-8 而不是 ASCII 编码的。
$ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a
perl -p -e's/à/a/gu' <<< 'à'
worked for me. perl -p -e's/à/a/gu' <<< 'à'
为我工作。
One practical approach -- use Text::Unidecode一种实用的方法——使用Text::Unidecode
perl -C -MText::Unidecode -pe'unidecode($_)' <<< 'à'
Prints a
.打印
a
.
Another approach: decompose characters ("normalize") using Unicode::Normalize , so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme , then remove the diacriticals ( \p{NonspacingMark}
or \p{Mn}
) with a simple regex.另一种方法:使用Unicode::Normalize分解字符(“规范化”),以便将字符及其变音符号(组合重音)分隔成自己的代码点,同时它们仍然形成有效的字形,然后删除变音符号(
\p{NonspacingMark}
或\p{Mn}
)与一个简单的正则表达式。
Both of these ways will have exceptions and edge cases but I think it may just do what you need.这两种方式都会有例外和边缘情况,但我认为它可能只是做你需要的。
Here's how I implemented steps 2 and 3.以下是我实施步骤 2 和 3 的方式。
This can be used, eg, in these dictionaries (though I didn't test it on every language).例如,这可以在这些字典中使用(尽管我没有在每种语言上都对其进行测试)。
asciify-dic
#!/usr/bin/env bash
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD>
if [[ "$1" == "--help" ]]; then
echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE"
echo "Asciify a .dic file (list of dictionary words)."
echo ""
echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars."
echo "These additional words can be used to make spell-checking accent-insensitive."
echo "Comment lines beginning with % are left unchanged."
exit
fi
# Filter words containing non-ascii characters, except in comments
grep -P '^\%|[^\x00-\x7F]' $1 |
# Make words accent-insensitive, except in comments
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' |
# Remove duplicate lines, except in comments
awk '/^\s*%/||!seen[$0]++'
Example usage:示例用法:
asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic
The short answer is to add -Mutf8
to your command line.简短的回答是将
-Mutf8
添加到您的命令行中。
If you're not sure how Perl is interpreting what you wrote on the command line you can make it spit it back to you with the core B::perlstring()
function or deparse the whole script with B::Deparse
.如果您不确定 Perl 如何解释您在命令行上写的内容,您可以使用核心
B::perlstring()
function 或使用B::Deparse
解析整个脚本。 That would illustrate your problem real fast.那将真正快速地说明您的问题。 (Enclosing the 'à' character in brackets doesn't do anything here.)
(将 'à' 字符括在括号中在这里没有任何作用。)
$ perl -MO=Deparse -pC -e 's/à/a/gu' <<< 'à'
LINE: while (defined($_ = <ARGV>)) {
s/\303\240/a/gu;
}
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
See how your substitution stragely has 2 characters in it?看看你的替换如何巧妙地包含 2 个字符?
You can then see immediately how use utf8
fixes your problem.然后您可以立即看到
use utf8
如何解决您的问题。
$ perl -MO=Deparse -Mutf8 -pC -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
s/\340/a/gu;
}
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
You can use perlstring()
to make sure Perl is receiving the input you think.您可以使用
perlstring()
来确保 Perl 正在接收您认为的输入。
$ perl -p -MB -E 'say B::perlstring($_)' <<< 'à'
"\303\240\n"
à
$ perl -pC -MB -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à
You can see that without -C
Perl is receiving 2 decomposed characters.您可以看到没有
-C
Perl 正在接收 2 个分解的字符。
Depending on the circumstances, Perl dumps characters as either an octal code ( \340
) or a hexadecimal code ( \xE0
).根据具体情况,Perl 将字符转储为八进制代码 (
\340
) 或十六进制代码 ( \xE0
)。 Note well here that you can always replace raw unicode characters in your command line with the escape code version.请注意,您始终可以将命令行中的原始 unicode 字符替换为转义码版本。 This is a great way to make explicit what otherwise would be ambiguous.
这是一个很好的方式来明确什么否则会模棱两可。
$ perl -pC -e 's/[\xE0]/a/gu' <<< 'à'
a
If you don't want to have to remember UTF8 mode, you can shove those options in the PERL5OPT
environment variable or create a shell alias.如果您不想记住 UTF8 模式,您可以在
PERL5OPT
环境变量中添加这些选项或创建 shell 别名。 Beware of making this global!小心把它变成全球性的!
$ export PERL5OPT='-C -Mutf8'
$ perl -MO=Deparse -p -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
s/\340/a/gu;
}
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
$ perl -MB -p -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à
Or as a shell alias.或作为 shell 别名。
alias uperl='perl -C -Mutf8'
See perlrun for more information on how to Swiss Army Chainsaw the command line.有关如何使用 Swiss Army Chainsaw 命令行的更多信息,请参阅perlrun 。
See also B::Deparse .另请参见B::Deparse 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.