简体   繁体   中英

Finding repeated words in PHP without specifying the word itself

I've been thinking about something for a project I want to do, I'm not an advance user and I'm just learning. Do not know if this is possible:

Suppose we have 100 html documents containing many tables and text inside them.

Question one is: is it possible to analyze all this text and find words repeated and count it?.

Yes, It's possible to do with some functions but here's the problem: what if we did not know the words that will gonna find? That is, we would have to tell the code what a word means.

Suppose, for example, that one word would be a union of seven characters, the idea would be to find other similar patterns and mention it. What would be the best way to do this?

Thank you very much in advance.

Example:

Search: Five characters patterns on the next phrases:

Text one:

"It takes an ocean not to break"

Text two:

"An ocean is a body of saline water"

Result

Takes 1 
Break 1
water 1
Ocean 2

Thanks in advance for your help.

function get_word_counts($phrases) {
   $counts = array();
    foreach ($phrases as $phrase) {
        $words = explode(' ', $phrase);
        foreach ($words as $word) {
          $word = preg_replace("#[^a-zA-Z\-]#", "", $word);
            $counts[$word] += 1;
        }
    }
    return $counts;
}

$phrases = array("It takes an ocean of water not to break!", "An ocean is a body of saline water, or so I am told.");

$counts = get_word_counts($phrases);
arsort($counts);
print_r($counts);

OUTPUT

Array
(
    [of] => 2
    [ocean] => 2
    [water] => 2
    [or] => 1
    [saline] => 1
    [body] => 1
    [so] => 1
    [I] => 1
    [told] => 1
    [a] => 1
    [am] => 1
    [An] => 1
    [an] => 1
    [takes] => 1
    [not] => 1
    [to] => 1
    [It] => 1
    [break] => 1
    [is] => 1
)

EDIT
Updated to deal with basic punctuation, based on @Jack's comment.

An alternative method using in-built functions that also ignores short words:

   function get_word_counts($text) 
   {
        $words = str_word_count($text, 1);
        foreach ($words as $k => $v) if (strlen($v) < 4) unset($words[$k]); // ignore short words
        $counts = array_count_values($words);
        return $counts;
    }
$counts = get_word_counts($text);
arsort($counts);        
print_r($counts);

Note: this assumes a single block of text, if processing an array of phrases add foreach ($phrases as $phrase) etc

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM