简体   繁体   English

将文本拆分为单词,但将逗号视为单词

[英]Split text into words but consider comma as word

I would like to get each word of this text but need to consider comma as a separate word, in PHP:我想获取此文本的每个单词,但需要在 PHP 中将逗号视为一个单独的单词:

My input text:我的输入文本:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

My wanted array:我想要的数组:

array[0] => "Lorem"
array[1] => "ipsum"
array[2] => "dolor"
array[3] => "sit"
array[4] => "amet"
array[5] => ","
array[6] => "consectetuer"
array[7] => "adipiscing"
array[8] => "elit"
array[9] => "."

What I get with explode(" ", $text) is:我用explode(" ", $text)得到的是:

array[0] => "Lorem"
array[1] => "ipsum"
array[2] => "dolor"
array[3] => "sit"
array[4] => "amet,"
array[5] => "consectetuer"
array[6] => "adipiscing"
array[7] => "elit."

You could replace the comma with a space+comma+space ',' -> ' , '您可以将逗号替换为空格+逗号+空格 ',' -> ' , '

$newSentence = str_replace("," , " , " , $theSentence);  
$arr = preg_split('/[\s]+/', $newSentence);

Try below,试试下面,

preg_match_all('/([\w]+)([\,\.])?/', "Lorem ipsum dolor sit amet, consectetuer adipiscing elit.",$match);

$arr = array_merge($match[1],array_filter($match[2]));
print_r($arr);

You should use preg_match_all() without any capture groups or lookarounds for best efficiency.您应该使用preg_match_all()而不使用任何捕获组或环视以获得最佳效率。

Code: ( Demo )代码:( 演示

$string='Lorem ipsum dolor sit amet, consectetuer adipiscing elit.';
var_export(preg_match_all('/[a-z]+|\S/i',$string,$out)?$out[0]:'fail');

Output:输出:

array (
  0 => 'Lorem',
  1 => 'ipsum',
  2 => 'dolor',
  3 => 'sit',
  4 => 'amet',
  5 => ',',
  6 => 'consectetuer',
  7 => 'adipiscing',
  8 => 'elit',
  9 => '.',
)

\w can be used to match az , AZ , 0-9 , and _ but in your sample only letters exist. \w可用于匹配azAZ0-9_但在您的示例中仅存在字母。

If you are including apostrophes, you can use $pattern='/[az\']+|\S/i' but future adjustments are decisions for you to make.如果您包含撇号,您可以使用$pattern='/[az\']+|\S/i'但未来的调整是您自己的决定。

The \S in the second alternative is any non-whitespace character -- this collects all of the punctuation characters (one at a time) that the first alternative lets through.第二种选择中的\S是任何非空白字符——它收集第一种选择允许通过的所有标点符号(一次一个)。

The i flag on the pattern dictates that [az] will act like [A-Za-z] .模式上的i标志表明[az]的行为类似于[A-Za-z]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM