简体   繁体   English

最常用的文字用php

[英]Most used words in text with php

I found the code below on stackoverflow and it works well in finding the most common words in a string. 我在stackoverflow上找到了下面的代码,它可以很好地找到字符串中最常见的单词。 But can I exclude the counting on common words like "a, if, you, have, etc"? 但是,我可以排除对“a,if,you,have等”等常用词的统计吗? Or would I have to remove the elements after counting? 或者我必须在计数后删除元素? How would I do this? 我该怎么做? Thanks in advance. 提前致谢。

<?php

$text = "A very nice to tot to text. Something nice to think about if you're into text.";


$words = str_word_count($text, 1); 

$frequency = array_count_values($words);

arsort($frequency);

echo '<pre>';
print_r($frequency);
echo '</pre>';
?>

This is a function that extract common words from a string. 这是一个从字符串中提取常用单词的函数。 it takes three parameters; 它需要三个参数; string, stop words array and keywords count. 字符串,停止字数组和关键字计数。 you have to get the stop_words from txt file using php function that take txt file into array 你必须使用PHP函数从txt文件中获取stop_words,将txt文件转换为数组

$stop_words = file('stop_words.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES); $ stop_words = file('stop_words.txt',FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

$this->extract_common_words( $text, $stop_words) $ this-> extract_common_words($ text,$ stop_words)

You can use this file stop_words.txt as your primary stop words file, or create your own file. 您可以使用此文件stop_words.txt作为主要停用词文件,或创建自己的文件。

function extract_common_words($string, $stop_words, $max_count = 5) {
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase

      preg_match_all('/\b.*?\b/i', $string, $match_words);
      $match_words = $match_words[0];

      foreach ( $match_words as $key => $item ) {
          if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
              unset($match_words[$key]);
          }
      }  

      $word_count = str_word_count( implode(" ", $match_words) , 1); 
      $frequency = array_count_values($word_count);
      arsort($frequency);

      //arsort($word_count_arr);
      $keywords = array_slice($frequency, 0, $max_count);
      return $keywords;
}

There's not additional parameters or a native PHP function that you can pass words to exclude. 没有其他参数或本机PHP函数可以传递要排除的单词。 As such, I would just use what you have and ignore a custom set of words returned by str_word_count . 因此,我只会使用您拥有的内容并忽略str_word_count返回的自定义单词str_word_count

You can do this easily by using array_diff() : 您可以使用array_diff()轻松完成此操作:

$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");

print_r(array_diff($words, $stopwords));

gives

 Array
(
    [2] => do
    [3] => this
    [4] => I
    [5] => do
    [6] => that
)

But you have to take care of lower and upper case yourself. 但你必须自己照顾大小写。 The easiest way here would be to convert the text to lowercase beforehand. 这里最简单的方法是事先将文本转换为小写。

Here is my solution by using the built-in PHP functions: 这是我使用内置PHP函数的解决方案:

most_frequent_words — Find most frequent word(s) appeared in a String most_frequent_words - 查找字符串中出现的最常见的单词

function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

Returns array contains word(s) appeared most frequently in the string. 返回数组包含字符串中最常出现的单词。

Parameters : 参数:

string $string - The input string. string $ string - 输入字符串。

array $stop_words (optional) - List of words which are filtered out from the array, Default empty array. array $ stop_words (可选) - 从数组中过滤掉的单词列表,默认为空数组。

string $limit (optional) - Limit the number of words returned, Default 5 . string $ limit (可选) - 限制返回的单词数,默认值为5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM