简体   繁体   中英

Split string into array based on a unicode character range in PHP

Sorry for the ambiguous subject, what I'm looking for is to have a string with cyrillic characters that may go like

«Добрый день!» - сказал он, потянувшись…

into an array that goes like

[0] => «
[1] => Добрый␠
[2] => день!»␠-␠
[3] => сказал␠
[4] => он,␠
[5] => потянувшись…

So essentially I'm looking for a break to occur on a border between any character and a cyrillic character ([а-я] range) although this must only be true when we transit from any character to a cyrillic character, not vice versa. I've seen examples that successfully solve this with punctuation characters and latin alphabet with

preg_split('/([^.:!?]+[.:!?]+)/', 'hello:there.everyone!so.how?are:you', NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

but my attempts to repurpose it into something different have so far failed:

preg_split ('/(?<=[^а-я])/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

almost works but it also splits by regular characters such as spaces and punctuation marks and that is not what I want. Clearly there's something wrong with my regex. How should I modify that to get the result as in the example above?

Use the following regex solution:

$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
//   [0] => «
//   [1] => Добрый 
//   [2] => день!» - 
//   [3] => сказал 
//   [4] => он, 
//   [5] => потянувшись…
//)

See the PHP demo

Details :

  • \\b(\\p{Cyrillic}+\\W*) - matches and captures a whole Cyrillic word with 0+ non-word chars after it
  • The pattern is wrapped with capturing parentheses and PREG_SPLIT_DELIM_CAPTURE will push the captured values into the resulting array
  • PREG_SPLIT_NO_EMPTY will discard empty values in the array
  • /u modifier will make the \\b (word boundary) and \\W Unicode aware, and will allow processing Unicode strings with regex.

How about splitting at an initial \\b word boundary with u modifier .

$res = preg_split('/\b(?=\w)(?!^)/u', $str);

The lookahead ensures \\b is followed by a word character . (?!^) prevents empty match if start .

See this demo at eval.in

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

Array
(
    [0] => «
    [1] => Добрый 
    [2] => день!» - 
    [3] => сказал 
    [4] => он, 
    [5] => потянувшись…
)

Here you can try it.

Try this regex: [\\x{0400}-\\x{04FF}]*[^\\x{0400}-\\x{04FF}]* . All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \\x{0400}-\\x{04FF} with \\p{Cyrillic} as suggested in another answer.

This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM