简体   繁体   English

按空格和大写字母拆分文本 (PHP)

[英]Split a text by space and capital letter (PHP)

I am trying to break the text by sentences.我试图用句子来打破文本。 There are no dots in this text.这篇文章中没有点。 But it contains capital letters.但它包含大写字母。 I use:我用:

 <?php preg_match_all('/[A-Z][^A-Z]*?/Usu',$text,$sentences);

But it split the text only by capital letters.但它只用大写字母分割文本。 So I have such sentences as "S", "M", "S".所以我有“S”、“M”、“S”这样的句子。 It is wrong.这是错误的。 I do not need to break such words as SMS.我不需要打破短信之类的词。 Help please.请帮忙。

Some clarification:一些澄清:

  • I try to break the string before each string of one or more capital letters.我尝试在一个或多个大写字母的每个字符串之前断开字符串。
  • But my real task is more complex.但我真正的任务更复杂。 I am trying to format text for readability.我正在尝试格式化文本以提高可读性。
  • Example: a piece of vacancy without html tags and line breaks: "Desirable: AWS experience Experience with Docker/Kubernetes".示例:一个没有 html 标签和换行符的空缺:“Desirable: AWS experience Experience with Docker/Kubernetes”。 I try to get: "Desirable:", "AWS experience" and "Experience with Docker/Kubernetes" (I think I will be able to stick together very short strings after splitting by space and capital letter. Maybe it is a very bad way, of course).我尝试得到:“Desirable:”、“AWS 经验”和“Docker/Kubernetes 经验”(我想我将能够在用空格和大写字母分割后将非常短的字符串粘在一起。也许这是一个非常糟糕的方式, 当然)。

You really shouldn't be using regex to parse something as complex as natural language.你真的不应该使用正则表达式来解析像自然语言这样复杂的东西。 I'd recommend something like IntlBreakIterator instead.我会推荐类似IntlBreakIterator的东西。

$text = "Sentence 1. Sentence 2! Sentence 3? Sentence; number 4...Sentence, 5.";

$it = IntlBreakIterator::createSentenceInstance("en_US");
$it->setText($text);
$parts = $it->getPartsIterator();

foreach ($parts as $point => $sentence) {
    echo "$point => $sentence\n\n\n";
}

Output Output

0 => Sentence 1. 


1 => Sentence 2! 


2 => Sentence 3? 


3 => Sentence; number 4...


4 => Sentence, 5.

The rules for parsing words/sentences can be complex and daunting to implement in a regular expression.解析单词/句子的规则可能很复杂,并且在正则表达式中实现起来令人生畏。 This solution is more sane for syntactically correct corpus.对于语法正确的语料库,此解决方案更为明智。 However, if the text has no punctuation like you say then there is no sane way to distinguish one sentence from another.但是,如果文本没有你所说的标点符号,那么就没有理智的方法来区分一个句子和另一个句子。 Simply attempting to do it by capital letters can yield a lot of false positives because words can be capitalized mid-sentence such as proper nouns and some abbreviations.简单地尝试使用大写字母会产生很多误报,因为单词可以在句子中间大写,例如专有名词和一些缩写。

I assume you you wish to break a string into pieces, where the break points are zero-width positions that immediately precede a capital letter and do not follow a capital letter.我假设您希望将字符串分成几部分,其中断点是紧接在大写字母之前且不跟在大写字母之后的零宽度位置。 If so you could used the following regular expression.如果是这样,您可以使用以下正则表达式。

(?=(?<![A-Z]|^)[A-Z])

Regex demo正则表达式演示

The can be executed as follows:可以按如下方式执行:

<?php
$result = preg_split("/(?=(?<![A-Z]|^)[A-Z])/", "now is THE time to BE brave"); 
print_r($result); 

PHP demo PHP 演示

As shown at the link, this returns如链接所示,这将返回

Array
(
    [0] => now is 
    [1] => THE time to 
    [2] => BE brave
)

If the first word of the string were capitalized ( "Now" ), the first element of the string would be "Now is" (ie, not an empty string").如果字符串的第一个单词大写( "Now" ),则字符串的第一个元素将是"Now is" (即,不是空字符串)。

PHP's regex engine performs the following operations. PHP 的正则表达式引擎执行以下操作。

(?=           # begin a positive lookahead
  (?<!        # begin a negative lookbehind
    [A-Z]     # match a capital letter
    |         # or
    ^         # match the beginning of the line
  )           # end the negative lookbehind
  [A-Z]       # match a capital letter
)             # end positive lookahead

This attempts to match a capital letter in a positive lookahead ( [AZ] ), but that match fails if the negative lookbehind matches a capital letter preceding it or the capital letter is at the beginning of the string.这会尝试匹配正向前瞻 ( [AZ] ) 中的大写字母,但如果后向负向匹配它前面的大写字母或大写字母位于字符串的开头,则匹配失败。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM