简体繁体 English

中文文本的 TF-IDF 算法

[英]TF-IDF algorithm on chinese text

原文 2020-07-23 09:09:01 8 1 python/ tf-idf/ tfidfvectorizer

I am doing TF-IDF on chinese text and searching for top 10 used words in the text.我正在对中文文本进行 TF-IDF，并在文本中搜索前 10 个常用词。
when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other.当我获得前 10 个词时，我会看到一些无意义的词，例如“成为”、“表示”等。
Is there is any ways which only get meaningful words?有什么方法只能得到有意义的单词吗？
I am using "jieba" to cut the chinese sentence to words我正在使用“jieba”将中文句子切割成单词

1 个解决方案

The words like "成为", "表示" are what we refer to as stop words.像“成为”、“表示”这样的词就是我们所说的停用词。 In many cases, they are commonly used words that provide little meaning within the sentence, think the words "a", and "the" in English.在许多情况下，它们是在句子中提供很少意义的常用词，例如英语中的“a”和“the”。

It is sometimes necessary to remove these stopwords before performing analysis, especially for TF-IDF as it may lead to meaningless results as you have seen.有时需要在执行分析之前删除这些停用词，尤其是对于 TF-IDF，因为它可能会导致您所看到的毫无意义的结果。

It seems that Jieba doesn't include the functionality to remove stop words, but genediazjr collected a fairly comprehensive list of stopwords for the Chinese language. Jieba 似乎没有删除停用词的功能，但genediazjr收集了相当全面的中文停用词列表。 You can import this and remove these stop words from your original text before the TF-IDF analysis.您可以在 TF-IDF 分析之前将其导入并从原始文本中删除这些停用词。