简体   繁体   中英

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.

I have the following test string :

$kw='Test-Tes-Te-T-Schönheit-Test';

I want to remove small words less than three characters from this string.

So, I want the output to be

$kw='test-tes-schönheit-test';

I have tried this code :

$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-",  $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;

But the result is :

test-tes-sch-nheit-test

so, the German character ö is getting removed from the string and German word Schönheit is being treated as two words.

Please suggest how to solve this.

Thank you very much.

I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:

$kw = utf8_encode(strtolower($kw));

Now you can use the regex-unicode functionalities. \\p{L} is for letters and \\p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:

[^\p{L}\p{N}]

You want all word-characters:

[\p{L}\p{N}]

You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:

(?<=[^\p{L}\p{N}]|^)

Replace max 2 "word-characters" followed by a boundary or the end:

[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)

So your regex could look like this:

'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'

And decode to your local, if you like:

echo utf8_decode($kw);

Good luck! Robert

Your \\b word boundaries trip over the ö , because it's not an alphanumeric character. Per default PCRE works on ASCII letters.

The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer :

$kw = preg_replace("/\b[^-]{1,2}\b/u", "-",  $kw);

I would use preg_replace_callback or /e btw, and instead search for [AZ] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM