简体   繁体   中英

Most used words in text with php

I found the code below on stackoverflow and it works well in finding the most common words in a string. But can I exclude the counting on common words like "a, if, you, have, etc"? Or would I have to remove the elements after counting? How would I do this? Thanks in advance.

<?php

$text = "A very nice to tot to text. Something nice to think about if you're into text.";


$words = str_word_count($text, 1); 

$frequency = array_count_values($words);

arsort($frequency);

echo '<pre>';
print_r($frequency);
echo '</pre>';
?>

This is a function that extract common words from a string. it takes three parameters; string, stop words array and keywords count. you have to get the stop_words from txt file using php function that take txt file into array

$stop_words = file('stop_words.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

$this->extract_common_words( $text, $stop_words)

You can use this file stop_words.txt as your primary stop words file, or create your own file.

function extract_common_words($string, $stop_words, $max_count = 5) {
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase

      preg_match_all('/\b.*?\b/i', $string, $match_words);
      $match_words = $match_words[0];

      foreach ( $match_words as $key => $item ) {
          if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
              unset($match_words[$key]);
          }
      }  

      $word_count = str_word_count( implode(" ", $match_words) , 1); 
      $frequency = array_count_values($word_count);
      arsort($frequency);

      //arsort($word_count_arr);
      $keywords = array_slice($frequency, 0, $max_count);
      return $keywords;
}

There's not additional parameters or a native PHP function that you can pass words to exclude. As such, I would just use what you have and ignore a custom set of words returned by str_word_count .

You can do this easily by using array_diff() :

$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");

print_r(array_diff($words, $stopwords));

gives

 Array
(
    [2] => do
    [3] => this
    [4] => I
    [5] => do
    [6] => that
)

But you have to take care of lower and upper case yourself. The easiest way here would be to convert the text to lowercase beforehand.

Here is my solution by using the built-in PHP functions:

most_frequent_words — Find most frequent word(s) appeared in a String

function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

Returns array contains word(s) appeared most frequently in the string.

Parameters :

string - The input string. - 输入字符串。

array (optional) - List of words which are filtered out from the array, Default empty array. (可选) - 从数组中过滤掉的单词列表,默认为空数组。

string (optional) - Limit the number of words returned, Default . (可选) - 限制返回的单词数,默认值为

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM