I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \\p{L} is for letters and \\p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \\b
word boundaries trip over the ö
, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u
Unicode modifer :
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback
or /e
btw, and instead search for [AZ]
for replacing. And strtr
for the dashes or just [-+]+
for collapsing consecutive ones.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.