[英]Split a text by space and capital letter (PHP)
I am trying to break the text by sentences.我试图用句子来打破文本。 There are no dots in this text.
这篇文章中没有点。 But it contains capital letters.
但它包含大写字母。 I use:
我用:
<?php preg_match_all('/[A-Z][^A-Z]*?/Usu',$text,$sentences);
But it split the text only by capital letters.但它只用大写字母分割文本。 So I have such sentences as "S", "M", "S".
所以我有“S”、“M”、“S”这样的句子。 It is wrong.
这是错误的。 I do not need to break such words as SMS.
我不需要打破短信之类的词。 Help please.
请帮忙。
Some clarification:一些澄清:
You really shouldn't be using regex to parse something as complex as natural language.你真的不应该使用正则表达式来解析像自然语言这样复杂的东西。 I'd recommend something like
IntlBreakIterator
instead.我会推荐类似
IntlBreakIterator
的东西。
$text = "Sentence 1. Sentence 2! Sentence 3? Sentence; number 4...Sentence, 5.";
$it = IntlBreakIterator::createSentenceInstance("en_US");
$it->setText($text);
$parts = $it->getPartsIterator();
foreach ($parts as $point => $sentence) {
echo "$point => $sentence\n\n\n";
}
Output Output
0 => Sentence 1. 1 => Sentence 2! 2 => Sentence 3? 3 => Sentence; number 4... 4 => Sentence, 5.
The rules for parsing words/sentences can be complex and daunting to implement in a regular expression.解析单词/句子的规则可能很复杂,并且在正则表达式中实现起来令人生畏。 This solution is more sane for syntactically correct corpus.
对于语法正确的语料库,此解决方案更为明智。 However, if the text has no punctuation like you say then there is no sane way to distinguish one sentence from another.
但是,如果文本没有你所说的标点符号,那么就没有理智的方法来区分一个句子和另一个句子。 Simply attempting to do it by capital letters can yield a lot of false positives because words can be capitalized mid-sentence such as proper nouns and some abbreviations.
简单地尝试使用大写字母会产生很多误报,因为单词可以在句子中间大写,例如专有名词和一些缩写。
I assume you you wish to break a string into pieces, where the break points are zero-width positions that immediately precede a capital letter and do not follow a capital letter.我假设您希望将字符串分成几部分,其中断点是紧接在大写字母之前且不跟在大写字母之后的零宽度位置。 If so you could used the following regular expression.
如果是这样,您可以使用以下正则表达式。
(?=(?<![A-Z]|^)[A-Z])
The can be executed as follows:可以按如下方式执行:
<?php
$result = preg_split("/(?=(?<![A-Z]|^)[A-Z])/", "now is THE time to BE brave");
print_r($result);
As shown at the link, this returns如链接所示,这将返回
Array
(
[0] => now is
[1] => THE time to
[2] => BE brave
)
If the first word of the string were capitalized ( "Now"
), the first element of the string would be "Now is"
(ie, not an empty string").如果字符串的第一个单词大写(
"Now"
),则字符串的第一个元素将是"Now is"
(即,不是空字符串)。
PHP's regex engine performs the following operations. PHP 的正则表达式引擎执行以下操作。
(?= # begin a positive lookahead
(?<! # begin a negative lookbehind
[A-Z] # match a capital letter
| # or
^ # match the beginning of the line
) # end the negative lookbehind
[A-Z] # match a capital letter
) # end positive lookahead
This attempts to match a capital letter in a positive lookahead ( [AZ]
), but that match fails if the negative lookbehind matches a capital letter preceding it or the capital letter is at the beginning of the string.这会尝试匹配正向前瞻 (
[AZ]
) 中的大写字母,但如果后向负向匹配它前面的大写字母或大写字母位于字符串的开头,则匹配失败。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.