简体   繁体   English

重音字符的PHP正则表达式

[英]PHP Regex for Accented Characters

I try to filter a variable allowing alphanumeric ,spaces ,accented characters , and single quotes and replace the reste by a space , so a string like : 我尝试过滤一个允许字母数字,空格,带重音符号和单引号的变量,并用空格替换reste,这样的字符串如下:

substitué à une otage % ? 替代ot%吗? vendredi 23 mars lors de l'attaque vendredi 23火星勒阿塔克

should output : 应该输出:

substitué à une otage vendredi 23 mars lors de l'attaque 替代火星在23 Mars lors de l'attaque

but I get as Result the output : 但是我得到的结果是:

substitué à une otage vendredi 23 mars lors de l 替代人在马尔斯·洛斯23

could please help , this is my code 可以帮忙,这是我的代码

$whitelist = "/[^a-zA-Z0-9а-àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý',. ]/";

$descreption =  preg_replace($whitelist, ' ', $ds);
}else{
    $errors = self::DESCREPTION_ERROR;
    return false;
}

Your regex is faulty. 您的正则表达式有问题。 The part а-à gives the error Character range is out of order - I guess the - was added by mistake there... а-à部分给出错误Character range is out of order -我猜是-错误地添加了...

Then a small hint: ' is not ' 然后有一个小提示: '不是'

[^a-zA-Z0-9àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý'’,. ] 

should work fine. 应该工作正常。

Also, if you're working with Regex, tools like RegExr or regex101 are really a nice thing. 另外,如果您使用Regex,则RegExrregex101之类的工具确实是一件好事。

One way to deal with the range of accented characters is to use the POSIX [:alnum:] class, which in PHP in conjunction with the u modifier will match all of them. 处理重音字符范围的一种方法是使用POSIX [:alnum:]类,该类在PHP中与u修饰符一起将它们全部匹配。 That can then be put into a negated character class with the other characters you want to keep to allow the other characters to be removed: 然后可以将其与要保留的其他字符一起放入否定的字符类中,以允许删除其他字符:

$string = 'substitué à une otage % ? vendredi 23 mars lors de l’attaque';
echo preg_replace("/[^[:alnum:]'’,.]/u", ' ', $string);

Output: 输出:

substitué à une otage vendredi 23 mars lors de l’attaque

As has been pointed out in the comments, ' is not the same as ' and so it also needs to be added to the set of characters you want to keep. 正如在评论中已经指出的那样, '是不一样的' ,所以它也需要被添加到设置要保留的字符。

Demo on 3v4l.org 3v4l.org上的演示

You may have a look at Unicode character properties . 您可以看看Unicode字符属性

Summary of my changes: 我的变更摘要:

  • use \\p{L} to match all letters 使用\\p{L}来匹配所有字母
  • escape the hyphen ( \\- ) 转义连字符( \\-
  • support typewriter ( ' ) and typographic ( ' ) apostrophes 支持打字机( ' )和印刷( ' )撇号

Here is the result: 结果如下:

$whitelist = '/[^\p{L}0-9\-\'’,. ]/u';

There is probably room for even further improvement. 可能还有进一步改进的空间。 Finally, don't forget to add the u modifier ! 最后,不要忘记添加u 修饰符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM