简体   繁体   English

正则表达式,用于验证和清除PHP中的所有英语和非英语unicode字母字符

[英]Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP

While there have been many questions regarding the non-english characters regex issue I have not been able to find a working answer. 尽管有很多关于非英语字符正则表达式的问题,但我仍然找不到有效的答案。 Moreover, there does not seem to be any simple PHP library which would help me to filter non-english input. 而且,似乎没有任何简单的PHP库可以帮助我过滤非英语输入。

Could you please suggest me a regular expression which would allow 你能建议我一个正则表达式吗

  1. all english alphabet characters (abc...) 所有英文字母字符(abc ...)
  2. all non-english alphabet characters (šýüčá...) 所有非英语字母字符(šýüčá...)
  3. spaces 空间
  4. case insensitive 不区分大小写

in validation as well as sanitization. 在验证以及消毒方面。 Essentially, I want either preg_match to return false when the input contains anything else than the 4 points above or preg_replace to get rid of everything except these 4 categories. 本质上,当输入包含上面4个点之外的任何内容时,我希望preg_match返回false,或者让preg_replace摆脱除这4个类别之外的所有内容。

I was able to create '/^((\\p{L}\\p{M}*)|(\\p{Cc})|(\\p{Z}))+$/ui' from http://www.regular-expressions.info/unicode.html . 我能够从http:// www创建'/^((\\p{L}\\p{M}*)|(\\p{Cc})|(\\p{Z}))+$/ui' .regular-expressions.info / unicode.html This regular expression works well when validating input but not when sanitizing it. 此正则表达式在验证输入时很有效,但在清理输入时效果不佳。

EDIT: 编辑:

User enters 'český [jazyk]' as an input. 用户输入“český[jazyk]”作为输入。 Using '/^[\\p{L}\\p{Zs}]+$/u' in preg_match, the script determines that the string contains unallowed characters (in this case '[' and ']'). 在preg_match中使用'/^[\\p{L}\\p{Zs}]+$/u' ,脚本确定字符串包含不允许的字符(在这种情况下为'['和']')。 Next I would like to use preg_replace, to delete those unwanted characters. 接下来,我想使用preg_replace删除那些不需要的字符。 What regular expression should I pass into preg_replace to match all characters that are not specified by the regular expression stated above? 我应该将什么正则表达式传递给preg_replace才能匹配上述正则表达式未指定的所有字符?

I think all you need is a character class like: 我认为您所需要的只是一个字符类,例如:

^[\p{L}\p{Zs}]+$

It means: The whole string (or line, with (?m) option) can only contain Unicode letters or spaces. 这意味着:整个字符串(或带(?m)选项的行)只能包含Unicode字母或空格。

Have a look at the demo . 看一下演示

$re = "/^[\\p{L}\\p{Zs}]+$/um"; 
$str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive"; 
preg_match_all($re, $str, $matches);

To remove all symbols that are not Unicode letters or spaces, use this code: 要删除所有不是Unicode字母或空格的符号,请使用以下代码:

$re = "/[^\\p{L}\\p{Zs}]+/u"; 
$str = "český [jazyk]"; 
echo preg_replace($re, "", $str);

The output of the sample program : 示例程序的输出:

český jazyk

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM