简体   繁体   English

Perl正则表达式替换为UTF-8字符

[英]Perl regex replace with UTF-8 characters

I despair on a function that I try to write with Perl . 我对尝试用Perl编写的函数感到绝望。 My function is to filter a string for specific characters. 我的功能是为特定字符过滤字符串。 I allow some characters like AZ, az, 0-9 and I want also allow some german umlauts. 我允许一些字符,例如AZ, az, 0-9并且我还希望允许一些德国变音符号。 But every time I define them in my regular expression, the replacement fails. 但是每次我在正则表达式中定义它们时,替换都会失败。

My encoding is UTF-8 (server, perl, scripts). 我的编码是UTF-8 (服务器,Perl,脚本)。

This is my function: 这是我的功能:

sub cleanXSS{

    my $string = shift;

    $string =~ s/[^A-Za-z0-9öäü]//g;

    return $string;
}

My script looks like this: 我的脚本如下所示:

my $scalar = "áéíóúÁÉÍüÓÚâêÄîôßû()ÂÊÎÔÛabcäüöÄÜÖý#µzdjheäöü";
print cleanXSS($scalar)."\n";

So it should replace all characters except AZ, az, 0-9 and lower case umlauts. 因此,它应该替换除AZ, az, 0-9和小写变音符号之外的所有字符。 The replacement for german umlauts in my test string works fine, but it seems that all other latin characters were only replaced partially. 在我的测试字符串中替换德国变音符的效果很好,但似乎所有其他拉丁字符仅被部分替换了。

The console output looks like this: 控制台输出如下所示:

▒▒▒▒▒▒▒▒▒ü▒▒▒▒▒▒▒▒▒▒▒▒▒▒abcäüö▒▒▒▒zdjheäöü

I've tried many solution approaches like "use locale", other encodings, explicit encoding via "use Encode" and so on. 我尝试了许多解决方案,例如“使用区域设置”,其他编码,通过“使用编码”的显式编码等等。

It seems that in a character like á only 1 of the 2 bytes is replaced. 似乎在á这样的字符中,仅2个字节中的1个被替换了。 If I change my replacement to this: 如果我将替换项更改为此:

$string =~ s/[^A-Za-z0-9öäü]/_/g;

I get the following output: 我得到以下输出:

▒_▒_▒_▒_▒_ö▒_▒_▒_ü▒_▒_▒_▒_▒_▒_▒_▒_▒___▒_▒_▒_▒_▒_abcäüö▒_▒_▒_▒____zdjheäöü

How can I achieve this ? 我该如何实现?

It seems that in a character like "á" only 1 of the 2 bytes is replaced. 似乎在类似“á”的字符中,只有2个字节中的1个被替换了。

  1. Decode inputs. 解码输入。

    You didn't tell Perl your script is encoded using UTF-8. 您没有告诉Perl您的脚本是使用UTF-8编码的。 Add

     use utf8; 
  2. Encode output. 编码输出。

    You'll also need the following to encode the output: 您还需要执行以下操作来编码输出:

     use open ':std', ':encoding(UTF-8)'; 

Put this line at the begining of the script: 将这一行放在脚本的开头:

binmode STDOUT, ":encoding(UTF-8)";

See the doc 参阅文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM