简体   繁体   English

如何在PHP中使用正则表达式将文本拆分为Unicode单词

[英]How to split text into Unicode words with Regular Expression in PHP

I have a web site module which collects some tweets from twitter and splits them as words to put into a database. 我有一个网站模块,它从twitter收集一些推文,并将它们分成文字放入数据库。 However, as the tweets usually have Turkish characters [ıöüğşçİÖÜĞŞÇ], my module cannot divide the words correctly. 但是,由于推文通常有土耳其字符[ıöüğşçİÖÜĞŞÇ],我的模块不能正确划分单词。

For example, the phrase Aynı labda çalıştığım is split into Ayn , labda and alıştığım , but it should have been split into Aynı , labda and çalıştığım 例如,短语Aynılabdaçalıştığım分为Aynlabdaalıştığım ,但它本应分为Aynılabdaçalıştığım

Here's my code which does the job: 这是我的代码,它完成了这项工作:

preg_match_all('/(\A|\b)[A-Z\Ç\Ö\Ş\İ\Ğ\Ü]?[a-z\ç\ö\ş\ı\ğ\ü]+(\Z|\b)/u', $text,$a);

What do you think is wrong here? 你认为这里有什么问题?

Important Note: I'm not stupid not to split text by the space character, I need exactly these characters to match. 重要说明:我不会因空格字符拆分文本而愚蠢,我需要完全匹配这些字符。 I don't want any numerical or special character such as [,.!@#$^&*123456780]. 我不想要任何数字或特殊字符,如[,。!@#$ ^&* 123456780]。

I need a regular expression that will split this kısa isimleri ile "Vic" ve "Wick" vardı. 我需要一个正则表达式,将分裂这个kısaisimleriile“Vic”ve“Wick”vardı。

into this: 进入这个:

kısa
isimleri
ile
Vic
ve
Wick
vardı

More examples: 更多例子:

We're @test would be 我们是@test会的

We
re
test

Föö bär, we're @test to0 ÅÄÖ - 123 ok? Fööbär,我们@testto0ÅÄÖ - 123好吗? kthxbai? kthxbai? is split into this, 分成这个,

b
r
we
re
test
ok
kthxbai

but I want it to be: 但我希望它是:

Föö
bär
we
re
test
ÅÄÖ
ok
kthxbai

I would take a look at mb_split() . 我会看看mb_split()

$str = 'We\'re @test Aynı labda çalıştığım';
var_dump(\mb_split('\s', $str));

Gives me: 给我:

array
  0 => string 'We're' (length=5)
  1 => string '@test' (length=5)
  2 => string 'Aynı' (length=5)
  3 => string 'labda' (length=5)
  4 => string 'çalıştığım' (length=16)

This expression would give you the desired result (according to your examples): 这个表达式会给你想要的结果(根据你的例子):

/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u

\\pL matches any unicode letter. \\pL匹配任何unicode字母。 The lookarounds are needed to make sure it isn't followed or preceded by numbers, to completely exclude words containing any numbers. 需要使用外观来确保不遵循或先于数字,以完全排除包含任何数字的单词。

Example : 示例

$str = "Aynı, labda - çalıştığım? \"quote\". Föö bär, we're @test to0 ÅÄÖ - 123 ok? kthxbai?";
preg_match_all('/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u', $str, $m);
print_r($m);

Output: 输出:

Array
(
    [0] => Array
        (
            [0] => Aynı
            [1] => labda
            [2] => çalıştığım
            [3] => quote
            [4] => Föö
            [5] => bär
            [6] => we
            [7] => re
            [8] => test
            [9] => ÅÄÖ
            [10] => ok
            [11] => kthxbai
        )

)

Just match for any non-space character placed between word boundries. 只匹配字边界之间放置的任何非空格字符。

preg_match_all('/\b(\S+)\b/', $text, $a);

This way, it doesn't matter what characters are inside, as long as it's not a space, it'll match it. 这样,内部的字符无关紧要,只要它不是空格,它就会匹配它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM