Perl正则表达式替换为UTF-8字符

Question

I despair on a function that I try to write with Perl . 我对尝试用Perl编写的函数感到绝望。 My function is to filter a string for specific characters. 我的功能是为特定字符过滤字符串。 I allow some characters like AZ, az, 0-9 and I want also allow some german umlauts. 我允许一些字符，例如AZ, az, 0-9并且我还希望允许一些德国变音符号。 But every time I define them in my regular expression, the replacement fails. 但是每次我在正则表达式中定义它们时，替换都会失败。

My encoding is UTF-8 (server, perl, scripts). 我的编码是UTF-8 （服务器，Perl，脚本）。

This is my function: 这是我的功能：

sub cleanXSS{

    my $string = shift;

    $string =~ s/[^A-Za-z0-9öäü]//g;

    return $string;
}

My script looks like this: 我的脚本如下所示：

my $scalar = "áéíóúÁÉÍüÓÚâêÄîôßû()ÂÊÎÔÛabcäüöÄÜÖý#µzdjheäöü";
print cleanXSS($scalar)."\n";

So it should replace all characters except AZ, az, 0-9 and lower case umlauts. 因此，它应该替换除AZ, az, 0-9和小写变音符号之外的所有字符。 The replacement for german umlauts in my test string works fine, but it seems that all other latin characters were only replaced partially. 在我的测试字符串中替换德国变音符的效果很好，但似乎所有其他拉丁字符仅被部分替换了。

The console output looks like this: 控制台输出如下所示：

▒▒▒▒▒▒▒▒▒ü▒▒▒▒▒▒▒▒▒▒▒▒▒▒abcäüö▒▒▒▒zdjheäöü

I've tried many solution approaches like "use locale", other encodings, explicit encoding via "use Encode" and so on. 我尝试了许多解决方案，例如“使用区域设置”，其他编码，通过“使用编码”的显式编码等等。

It seems that in a character like á only 1 of the 2 bytes is replaced. 似乎在á这样的字符中，仅2个字节中的1个被替换了。 If I change my replacement to this: 如果我将替换项更改为此：

$string =~ s/[^A-Za-z0-9öäü]/_/g;

I get the following output: 我得到以下输出：

▒_▒_▒_▒_▒_ö▒_▒_▒_ü▒_▒_▒_▒_▒_▒_▒_▒_▒___▒_▒_▒_▒_▒_abcäüö▒_▒_▒_▒____zdjheäöü

How can I achieve this ? 我该如何实现？

Answer 1

It seems that in a character like "á" only 1 of the 2 bytes is replaced. 似乎在类似“á”的字符中，只有2个字节中的1个被替换了。

Decode inputs. 解码输入。
You didn't tell Perl your script is encoded using UTF-8. 您没有告诉Perl您的脚本是使用UTF-8编码的。 Add 加
```
 use utf8; 
```
Encode output. 编码输出。
You'll also need the following to encode the output: 您还需要执行以下操作来编码输出：
```
 use open ':std', ':encoding(UTF-8)'; 
```

Answer 2

Put this line at the begining of the script: 将这一行放在脚本的开头：

binmode STDOUT, ":encoding(UTF-8)";

See the doc 参阅文件

Perl正则表达式替换为UTF-8字符

问题描述

2 个解决方案

解决方案1
7 已采纳 2014-01-13 13:38:28

解决方案2
0 2014-01-13 13:33:04

Perl正则表达式替换为UTF-8字符

问题描述

2 个解决方案

解决方案1 7 已采纳 2014-01-13 13:38:28

解决方案2 0 2014-01-13 13:33:04

解决方案1
7 已采纳 2014-01-13 13:38:28

解决方案2
0 2014-01-13 13:33:04