简体   繁体   English

在 PHP 中查找重复的单词而不指定单词本身

[英]Finding repeated words in PHP without specifying the word itself

I've been thinking about something for a project I want to do, I'm not an advance user and I'm just learning.我一直在为我想做的项目考虑一些事情,我不是高级用户,我只是在学习。 Do not know if this is possible:不知道这是否可能:

Suppose we have 100 html documents containing many tables and text inside them.假设我们有 100 个 html 文档,其中包含许多表格和文本。

Question one is: is it possible to analyze all this text and find words repeated and count it?.问题一是:有没有可能分析所有这些文本并找到重复的单词并计算它?

Yes, It's possible to do with some functions but here's the problem: what if we did not know the words that will gonna find?是的,可以用一些函数来做,但问题是:如果我们不知道会找到的词怎么办? That is, we would have to tell the code what a word means.也就是说,我们必须告诉代码一个词的含义。

Suppose, for example, that one word would be a union of seven characters, the idea would be to find other similar patterns and mention it.例如,假设一个词是七个字符的并集,其想法是找到其他类似的模式并提及它。 What would be the best way to do this?什么是最好的方法来做到这一点?

Thank you very much in advance.非常感谢您提前。

Example:例子:

Search: Five characters patterns on the next phrases:搜索: 下一个短语的五个字符模式:

Text one:正文一:

"It takes an ocean not to break" “需要大海才能不破裂”

Text two:正文二:

"An ocean is a body of saline water" “海洋是咸水体”

Result结果

Takes 1 
Break 1
water 1
Ocean 2

Thanks in advance for your help.在此先感谢您的帮助。

function get_word_counts($phrases) {
   $counts = array();
    foreach ($phrases as $phrase) {
        $words = explode(' ', $phrase);
        foreach ($words as $word) {
          $word = preg_replace("#[^a-zA-Z\-]#", "", $word);
            $counts[$word] += 1;
        }
    }
    return $counts;
}

$phrases = array("It takes an ocean of water not to break!", "An ocean is a body of saline water, or so I am told.");

$counts = get_word_counts($phrases);
arsort($counts);
print_r($counts);

OUTPUT OUTPUT

Array
(
    [of] => 2
    [ocean] => 2
    [water] => 2
    [or] => 1
    [saline] => 1
    [body] => 1
    [so] => 1
    [I] => 1
    [told] => 1
    [a] => 1
    [am] => 1
    [An] => 1
    [an] => 1
    [takes] => 1
    [not] => 1
    [to] => 1
    [It] => 1
    [break] => 1
    [is] => 1
)

EDIT 编辑
Updated to deal with basic punctuation, based on @Jack's comment. 根据@Jack的评论更新以处理基本标点符号。

An alternative method using in-built functions that also ignores short words:使用内置函数的另一种方法也忽略短词:

   function get_word_counts($text) 
   {
        $words = str_word_count($text, 1);
        foreach ($words as $k => $v) if (strlen($v) < 4) unset($words[$k]); // ignore short words
        $counts = array_count_values($words);
        return $counts;
    }
$counts = get_word_counts($text);
arsort($counts);        
print_r($counts);

Note: this assumes a single block of text, if processing an array of phrases add foreach ($phrases as $phrase) etc注意:这假定一个文本块,如果处理一组短语添加foreach ($phrases as $phrase)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM