[英]PHP Regex for Accented Characters
I try to filter a variable allowing alphanumeric ,spaces ,accented characters , and single quotes and replace the reste by a space , so a string like : 我尝试过滤一个允许字母数字,空格,带重音符号和单引号的变量,并用空格替换reste,这样的字符串如下:
substitué à une otage % ?
替代ot%吗? vendredi 23 mars lors de l'attaque
vendredi 23火星勒阿塔克
should output : 应该输出:
substitué à une otage vendredi 23 mars lors de l'attaque
替代火星在23 Mars lors de l'attaque
but I get as Result the output : 但是我得到的结果是:
substitué à une otage vendredi 23 mars lors de l
替代人在马尔斯·洛斯23
could please help , this is my code 可以帮忙,这是我的代码
$whitelist = "/[^a-zA-Z0-9а-àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý',. ]/";
$descreption = preg_replace($whitelist, ' ', $ds);
}else{
$errors = self::DESCREPTION_ERROR;
return false;
}
Your regex is faulty. 您的正则表达式有问题。 The part
а-à
gives the error Character range is out of order
- I guess the -
was added by mistake there... а-à
部分给出错误Character range is out of order
-我猜是-
错误地添加了...
Then a small hint: '
is not '
然后有一个小提示:
'
不是'
[^a-zA-Z0-9àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý'’,. ]
should work fine. 应该工作正常。
Also, if you're working with Regex, tools like RegExr or regex101 are really a nice thing. 另外,如果您使用Regex,则RegExr或regex101之类的工具确实是一件好事。
One way to deal with the range of accented characters is to use the POSIX [:alnum:]
class, which in PHP in conjunction with the u
modifier will match all of them. 处理重音字符范围的一种方法是使用POSIX
[:alnum:]
类,该类在PHP中与u
修饰符一起将它们全部匹配。 That can then be put into a negated character class with the other characters you want to keep to allow the other characters to be removed: 然后可以将其与要保留的其他字符一起放入否定的字符类中,以允许删除其他字符:
$string = 'substitué à une otage % ? vendredi 23 mars lors de l’attaque';
echo preg_replace("/[^[:alnum:]'’,.]/u", ' ', $string);
Output: 输出:
substitué à une otage vendredi 23 mars lors de l’attaque
As has been pointed out in the comments, '
is not the same as '
and so it also needs to be added to the set of characters you want to keep. 正如在评论中已经指出的那样,
'
是不一样的'
,所以它也需要被添加到设置要保留的字符。
You may have a look at Unicode character properties . 您可以看看Unicode字符属性 。
Summary of my changes: 我的变更摘要:
\\p{L}
to match all letters \\p{L}
来匹配所有字母 \\-
) \\-
) '
) and typographic ( '
) apostrophes '
)和印刷( '
)撇号 Here is the result: 结果如下:
$whitelist = '/[^\p{L}0-9\-\'’,. ]/u';
There is probably room for even further improvement. 可能还有进一步改进的空间。 Finally, don't forget to add the
u
modifier ! 最后,不要忘记添加
u
修饰符 !
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.