简体   繁体   English

PHP:如何从(多字节)字符串中提取所有预定义的子字符串?

[英]PHP: How to extract from a (multibyte) string all predefined substrings?

I'd like to split a string (representing a word) into letters and predefined multi-letter sequences.我想将一个字符串(代表一个单词)拆分为字母和预定义的多字母序列。 In other words I'd like to extract predefined substrings form a string matching in a "greedy" way and in order in which they occur.换句话说,我想以“贪婪”的方式和它们出现的顺序从匹配的字符串中提取预定义的子字符串。

For example if I my array of substrings contained all latin letters plus Polish digraphs: ['ch', 'cz', 'dz', 'dź', 'dż', 'rz', 'sz'] then szczebrzeszyn would be parsed into ['sz', 'cz', 'e', 'b', 'rz', 'e', 'sz', 'y', 'n'] .例如,如果我的子字符串数组包含所有拉丁字母和波兰语二合字母: ['ch', 'cz', 'dz', 'dź', 'dż', 'rz', 'sz']然后szczebrzeszyn将被解析进入['sz', 'cz', 'e', 'b', 'rz', 'e', 'sz', 'y', 'n']

Of course I could write some nested loops comparing character by character, but maybe there is some creative and more efficient way to obtain such result using the built-in string functions?当然,我可以编写一些逐字符比较的嵌套循环,但也许有一些创造性和更有效的方法可以使用内置字符串函数来获得这样的结果? How can I do this in PHP in an efficient and multi-byte safe way?如何以高效且多字节安全的方式在 PHP 中执行此操作?

preg_match_all('/sz|cz|\X/u', 'wszczęcie', $matches);
print_r($matches);

returns:返回:

Array
(
    [0] => Array
        (
            [0] => w
            [1] => sz
            [2] => cz
            [3] => ę
            [4] => c
            [5] => i
            [6] => e
        )
)

So the above code seems to do the job.所以上面的代码似乎可以完成这项工作。 The important points are: the substrings are matched in the order they are provided, so longer ones should go first ( 'cz' should be matched before 'c' , etc.).重要的一点是:子字符串按照它们提供的顺序匹配,所以应该先匹配更长的子字符串( 'cz'应该在'c'之前匹配,等等)。 And the u flag is important to make it multi-byte safe. u标志对于使其多字节安全很重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM