[英]Perl regex replace with UTF-8 characters
I despair on a function that I try to write with Perl
. 我对尝试用Perl
编写的函数感到绝望。 My function is to filter a string for specific characters. 我的功能是为特定字符过滤字符串。 I allow some characters like AZ, az, 0-9
and I want also allow some german umlauts. 我允许一些字符,例如AZ, az, 0-9
并且我还希望允许一些德国变音符号。 But every time I define them in my regular expression, the replacement fails. 但是每次我在正则表达式中定义它们时,替换都会失败。
My encoding is UTF-8
(server, perl, scripts). 我的编码是UTF-8
(服务器,Perl,脚本)。
This is my function: 这是我的功能:
sub cleanXSS{
my $string = shift;
$string =~ s/[^A-Za-z0-9öäü]//g;
return $string;
}
My script looks like this: 我的脚本如下所示:
my $scalar = "áéíóúÁÉÍüÓÚâêÄîôßû()ÂÊÎÔÛabcäüöÄÜÖý#µzdjheäöü";
print cleanXSS($scalar)."\n";
So it should replace all characters except AZ, az, 0-9
and lower case umlauts. 因此,它应该替换除AZ, az, 0-9
和小写变音符号之外的所有字符。 The replacement for german umlauts in my test string works fine, but it seems that all other latin characters were only replaced partially. 在我的测试字符串中替换德国变音符的效果很好,但似乎所有其他拉丁字符仅被部分替换了。
The console output looks like this: 控制台输出如下所示:
▒▒▒▒▒▒▒▒▒ü▒▒▒▒▒▒▒▒▒▒▒▒▒▒abcäüö▒▒▒▒zdjheäöü
I've tried many solution approaches like "use locale", other encodings, explicit encoding via "use Encode" and so on. 我尝试了许多解决方案,例如“使用区域设置”,其他编码,通过“使用编码”的显式编码等等。
It seems that in a character like á
only 1 of the 2 bytes is replaced. 似乎在á
这样的字符中,仅2个字节中的1个被替换了。 If I change my replacement to this: 如果我将替换项更改为此:
$string =~ s/[^A-Za-z0-9öäü]/_/g;
I get the following output: 我得到以下输出:
▒_▒_▒_▒_▒_ö▒_▒_▒_ü▒_▒_▒_▒_▒_▒_▒_▒_▒___▒_▒_▒_▒_▒_abcäüö▒_▒_▒_▒____zdjheäöü
How can I achieve this ? 我该如何实现?
It seems that in a character like "á" only 1 of the 2 bytes is replaced. 似乎在类似“á”的字符中,只有2个字节中的1个被替换了。
Decode inputs. 解码输入。
You didn't tell Perl your script is encoded using UTF-8. 您没有告诉Perl您的脚本是使用UTF-8编码的。 Add 加
use utf8;
Encode output. 编码输出。
You'll also need the following to encode the output: 您还需要执行以下操作来编码输出:
use open ':std', ':encoding(UTF-8)';
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.