简体   繁体   中英

php split half-width & full-width sentence

I need to split my text into an array at every period, exclamation and question mark.

Example with a full-width period and exclamation mark:

$string = "日本語を勉強しているみんなを応援したいです。一緒に頑張りましょう!";

I am looking for the following output:

Array ( 
    [0] => 日本語を勉強しているみんなを応援したいです。
    [1] => 一緒に頑張りましょう! )

I need the same code to work with half-width.

Example with a mix of full-width and half-width: $string = "Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?";

Output:

Array ( 
    [0] => Hi.
    [1] => I am Bob!
    [2] => Nice to meet you.
    [3] => 日本語を勉強しています。
    [4] => Do you understand me? )

I suck at regular expressions and can't figure out a solution nor find one.

I tried:

$string = preg_split('(.*?[。?!])', $string);

First of all, you forgot your delimiters (most commonly a slash).

You can split on \\pP (a unicode punctuation - remember the u modifier meaning unicode):

You can see the rest of the special unicode characters here .

<?php

$str = 'Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?';

$array = preg_split('/(?<=\pP)\s*/u', $str, null, PREG_SPLIT_NO_EMPTY);

print_r($array);

The PREG_SPLIT_NO_EMPTY is there to make sure that we don't include an empty match if your last character is punctuation.

Output :

Array
(
    [0] => Hi.
    [1] => I am Bob!
    [2] => Nice to meet you.
    [3] => 日本語を勉強しています。
    [4] => Do you understand me?
)

Regex autopsy :

  • / - the start delimiter - this must also come at the end before our modifiers
  • (?<=\\pP) - a positive lookbehind matching \\pP (a unicode punctuation - we could just use \\pP , but then the punctuation would not be included in our final string - a positive lookbehind includes it)
  • \\s* - a white space character matched 0 to infinity times - this is to make sure that we don't include the white space after the punctuation
  • /u - the end delimiter ( / ) and our modifier ( u meaning "unicode")

DEMO

Your first sentence would result in the following array:

Array
(
    [0] => 日本語を勉強しているみんなを応援したいです。
    [1] => 一緒に頑張りましょう!
)

Please note that this includes all punctuation including commas.

Array
(
    [0] => This is my sentence,
    [1] => and it is very nice.
)

This can be fixed by using a negative lookbehind in front of our positive lookbehind:

/(?<![,、;;"”\'’``])(?<=\pP)\s*/u

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM