Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP

Question

While there have been many questions regarding the non-english characters regex issue I have not been able to find a working answer. Moreover, there does not seem to be any simple PHP library which would help me to filter non-english input.

Could you please suggest me a regular expression which would allow

all english alphabet characters (abc...)
all non-english alphabet characters (šýüčá...)
spaces
case insensitive

in validation as well as sanitization. Essentially, I want either preg_match to return false when the input contains anything else than the 4 points above or preg_replace to get rid of everything except these 4 categories.

I was able to create '/^((\\p{L}\\p{M}*)|(\\p{Cc})|(\\p{Z}))+$/ui' from http://www.regular-expressions.info/unicode.html . This regular expression works well when validating input but not when sanitizing it.

EDIT:

User enters 'český [jazyk]' as an input. Using '/^[\\p{L}\\p{Zs}]+$/u' in preg_match, the script determines that the string contains unallowed characters (in this case '[' and ']'). Next I would like to use preg_replace, to delete those unwanted characters. What regular expression should I pass into preg_replace to match all characters that are not specified by the regular expression stated above?

Answer 1

I think all you need is a character class like:

^[\p{L}\p{Zs}]+$

It means: The whole string (or line, with (?m) option) can only contain Unicode letters or spaces.

Have a look at the demo .

$re = "/^[\\p{L}\\p{Zs}]+$/um"; 
$str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive"; 
preg_match_all($re, $str, $matches);

To remove all symbols that are not Unicode letters or spaces, use this code:

$re = "/[^\\p{L}\\p{Zs}]+/u"; 
$str = "český [jazyk]"; 
echo preg_replace($re, "", $str);

The output of the sample program :

český jazyk

Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP

Question

1 answers

solution1
3 ACCPTED 2015-04-23 08:41:17

Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP

Question

1 answers

solution1 3 ACCPTED 2015-04-23 08:41:17

solution1
3 ACCPTED 2015-04-23 08:41:17