简体   繁体   English

如何使用PHP正则表达式在字符串中搜索包含重复单词的单词序列?

[英]How to use PHP regular expressions to search a string for word sequences containing repeated words?

I am using PHP to count the number of occurrences of a word sequence in a string. 我使用PHP来计算字符串中单词序列的出现次数。 In the following example cases, I am not getting the result I would like to see. 在以下示例中,我没有得到我希望看到的结果。

$subject1 = " [word1 [word1 [word1 [word1 [word3 ";
$pattern1 = preg_quote("[word1 [word1", '/');
echo "count of '[word1 [word1'=". preg_match_all("/(\s|^|\W)" . $pattern1 . "(?=\s|$|\W)/", $subject1, $dummy) . "<br/>"; 

$subject2 = " [word1 [word2 [word1 [word2 [word1 [helloagain ";
$pattern2 = preg_quote("[word1 [word2 [word1", '/');
echo "count of '[word1 [word2 [word1'=". preg_match_all("/(\s|^|\W)" . $pattern2 . "(?=\s|$|\W)/", $subject2, $dummy) . "<br/>";

the above returns: 以上回报:

count of '[word1 [word1'=2
count of '[word1 [word2 [word1'=1

I would like the result to be: 我希望结果如下:

count of '[word1 [word1'=3 // there are 3  instances of ‘[word1 [word1’ in $subject1
count of '[word1 [word2 [word1'=2 // // there are 2  instances of [word1 [word2 [word1’ in $subject2

One way to achieve this is each time the pattern is found in subject the next search should start from the second word in the matched substring. 实现此目的的一种方法是每次在主题中找到模式时,下一个搜索应该从匹配子字符串中的第二个单词开始。 Can such a regular expression be constructed? 可以构建这样的正则表达式吗? Thank you. 谢谢。

Use mb_substr_count 使用mb_substr_count

substr_count does not count overlapped values, but i dont know why, mb_substr_count does substr_count不计算重叠值,但我不知道为什么, mb_substr_count确实如此

$subject1 = " [word1 [word1 [word1 [word1 [word3 ";
echo mb_substr_count($subject1, "[word1 [word1"); // 3
echo mb_substr_count($subject1, "[word1 [word1 [word1"); // 2

EDIT: 编辑:

For future reference, 备查,

Apparently mb_substr_count acts differently on php 5.2 than php 5.3 . 显然mb_substr_count在php 5.2上的行为与php 5.3不同。 I suppose the right behavior of this function should be same as substr_count , only for multibyte support, and since substr_count doesn't support overlapping, so should mb_substr_count . 我想这个函数的正确行为应该与substr_count相同,仅用于多字节支持,并且由于substr_count不支持重叠,所以substr_count也应该mb_substr_count

So, though this answer works on php 5.2.6, do not use it, or you may have problems when you update your php version. 所以,虽然这个答案适用于php 5.2.6,但是不要使用它,否则当你更新php版本时可能会遇到问题。

Instead of preg_match_all, I'd use a while loop on preg_match with offset: 而不是preg_match_all,我在preg_match上使用带有offset的while循环:

$subject1 = " [word1 [word1 [word1 [word1 [word3 ";
$pattern1 = preg_quote("[word1 [word1", '/');
$offset=0;
$total=0;
while($count = preg_match("/(?:\s|^|\W)$pattern1(?=\s|$|\W)/", $subject1, $matches, PREG_OFFSET_CAPTURE, $offset)) {
    // summ all matches
    $total  += $count;
    // valorisation of offset with the position of the match + 1
    // the next preg_match will start at this position
    $offset  = $matches[0][1]+1;
}
echo "total=$total\n";

output: 输出:

total=3

The result for the second example is : total=2 第二个例子的结果是: total=2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM