I need to split my text into an array at every period, exclamation and question mark.
Example with a full-width period and exclamation mark:
$string = "日本語を勉強しているみんなを応援したいです。一緒に頑張りましょう!";
I am looking for the following output:
Array (
[0] => 日本語を勉強しているみんなを応援したいです。
[1] => 一緒に頑張りましょう! )
I need the same code to work with half-width.
Example with a mix of full-width and half-width: $string = "Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?";
Output:
Array (
[0] => Hi.
[1] => I am Bob!
[2] => Nice to meet you.
[3] => 日本語を勉強しています。
[4] => Do you understand me? )
I suck at regular expressions and can't figure out a solution nor find one.
I tried:
$string = preg_split('(.*?[。?!])', $string);
First of all, you forgot your delimiters (most commonly a slash).
You can split on \\pP
(a unicode punctuation - remember the u
modifier meaning unicode):
You can see the rest of the special unicode characters here .
<?php
$str = 'Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?';
$array = preg_split('/(?<=\pP)\s*/u', $str, null, PREG_SPLIT_NO_EMPTY);
print_r($array);
The PREG_SPLIT_NO_EMPTY
is there to make sure that we don't include an empty match if your last character is punctuation.
Output :
Array
(
[0] => Hi.
[1] => I am Bob!
[2] => Nice to meet you.
[3] => 日本語を勉強しています。
[4] => Do you understand me?
)
Regex autopsy :
/
- the start delimiter - this must also come at the end before our modifiers (?<=\\pP)
- a positive lookbehind matching \\pP
(a unicode punctuation - we could just use \\pP
, but then the punctuation would not be included in our final string - a positive lookbehind includes it) \\s*
- a white space character matched 0 to infinity times - this is to make sure that we don't include the white space after the punctuation /u
- the end delimiter ( /
) and our modifier ( u
meaning "unicode") Your first sentence would result in the following array:
Array
(
[0] => 日本語を勉強しているみんなを応援したいです。
[1] => 一緒に頑張りましょう!
)
Please note that this includes all punctuation including commas.
Array
(
[0] => This is my sentence,
[1] => and it is very nice.
)
This can be fixed by using a negative lookbehind in front of our positive lookbehind:
/(?<![,、;;"”\'’``])(?<=\pP)\s*/u
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.