While there have been many questions regarding the non-english characters regex issue I have not been able to find a working answer. Moreover, there does not seem to be any simple PHP library which would help me to filter non-english input.
Could you please suggest me a regular expression which would allow
in validation as well as sanitization. Essentially, I want either preg_match to return false when the input contains anything else than the 4 points above or preg_replace to get rid of everything except these 4 categories.
I was able to create '/^((\\p{L}\\p{M}*)|(\\p{Cc})|(\\p{Z}))+$/ui'
from http://www.regular-expressions.info/unicode.html . This regular expression works well when validating input but not when sanitizing it.
EDIT:
User enters 'český [jazyk]' as an input. Using '/^[\\p{L}\\p{Zs}]+$/u'
in preg_match, the script determines that the string contains unallowed characters (in this case '[' and ']'). Next I would like to use preg_replace, to delete those unwanted characters. What regular expression should I pass into preg_replace to match all characters that are not specified by the regular expression stated above?
I think all you need is a character class like:
^[\p{L}\p{Zs}]+$
It means: The whole string (or line, with (?m)
option) can only contain Unicode letters or spaces.
Have a look at the demo .
$re = "/^[\\p{L}\\p{Zs}]+$/um";
$str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive";
preg_match_all($re, $str, $matches);
To remove all symbols that are not Unicode letters or spaces, use this code:
$re = "/[^\\p{L}\\p{Zs}]+/u";
$str = "český [jazyk]";
echo preg_replace($re, "", $str);
The output of the sample program :
český jazyk
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.