I have a web site module which collects some tweets from twitter and splits them as words to put into a database. However, as the tweets usually have Turkish characters [ıöüğşçİÖÜĞŞÇ], my module cannot divide the words correctly.
For example, the phrase Aynı labda çalıştığım is split into Ayn , labda and alıştığım , but it should have been split into Aynı , labda and çalıştığım
Here's my code which does the job:
preg_match_all('/(\A|\b)[A-Z\Ç\Ö\Ş\İ\Ğ\Ü]?[a-z\ç\ö\ş\ı\ğ\ü]+(\Z|\b)/u', $text,$a);
What do you think is wrong here?
Important Note: I'm not stupid not to split text by the space character, I need exactly these characters to match. I don't want any numerical or special character such as [,.!@#$^&*123456780].
I need a regular expression that will split this kısa isimleri ile "Vic" ve "Wick" vardı.
into this:
kısa
isimleri
ile
Vic
ve
Wick
vardı
More examples:
We're @test would be
We
re
test
Föö bär, we're @test to0 ÅÄÖ - 123 ok? kthxbai? is split into this,
b
r
we
re
test
ok
kthxbai
but I want it to be:
Föö
bär
we
re
test
ÅÄÖ
ok
kthxbai
I would take a look at mb_split()
.
$str = 'We\'re @test Aynı labda çalıştığım';
var_dump(\mb_split('\s', $str));
Gives me:
array
0 => string 'We're' (length=5)
1 => string '@test' (length=5)
2 => string 'Aynı' (length=5)
3 => string 'labda' (length=5)
4 => string 'çalıştığım' (length=16)
This expression would give you the desired result (according to your examples):
/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u
\\pL
matches any unicode letter. The lookarounds are needed to make sure it isn't followed or preceded by numbers, to completely exclude words containing any numbers.
Example :
$str = "Aynı, labda - çalıştığım? \"quote\". Föö bär, we're @test to0 ÅÄÖ - 123 ok? kthxbai?";
preg_match_all('/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u', $str, $m);
print_r($m);
Output:
Array
(
[0] => Array
(
[0] => Aynı
[1] => labda
[2] => çalıştığım
[3] => quote
[4] => Föö
[5] => bär
[6] => we
[7] => re
[8] => test
[9] => ÅÄÖ
[10] => ok
[11] => kthxbai
)
)
Just match for any non-space character placed between word boundries.
preg_match_all('/\b(\S+)\b/', $text, $a);
This way, it doesn't matter what characters are inside, as long as it's not a space, it'll match it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.