PHP: remove small words from string ignoring german characters in the words

Question

I am trying to create slugs for urls.

I have the following test string :

$kw='Test-Tes-Te-T-Schönheit-Test';

I want to remove small words less than three characters from this string.

So, I want the output to be

$kw='test-tes-schönheit-test';

I have tried this code :

$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-",  $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;

But the result is :

test-tes-sch-nheit-test

so, the German character ö is getting removed from the string and German word Schönheit is being treated as two words.

Please suggest how to solve this.

Thank you very much.

Answer 1

I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:

$kw = utf8_encode(strtolower($kw));

Now you can use the regex-unicode functionalities. \\p{L} is for letters and \\p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:

[^\p{L}\p{N}]

You want all word-characters:

[\p{L}\p{N}]

You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:

(?<=[^\p{L}\p{N}]|^)

Replace max 2 "word-characters" followed by a boundary or the end:

[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)

So your regex could look like this:

'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'

And decode to your local, if you like:

echo utf8_decode($kw);

Good luck! Robert

Answer 2

Your \\b word boundaries trip over the ö , because it's not an alphanumeric character. Per default PCRE works on ASCII letters.

The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer :

$kw = preg_replace("/\b[^-]{1,2}\b/u", "-",  $kw);

I would use preg_replace_callback or /e btw, and instead search for [AZ] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

PHP: remove small words from string ignoring german characters in the words

Question

2 answers

solution1
2 ACCPTED 2012-11-30 10:58:02

solution2
1 2012-11-30 05:45:05

PHP: remove small words from string ignoring german characters in the words

Question

2 answers

solution1 2 ACCPTED 2012-11-30 10:58:02

solution2 1 2012-11-30 05:45:05

solution1
2 ACCPTED 2012-11-30 10:58:02

solution2
1 2012-11-30 05:45:05