简体   繁体   中英

TF-IDF algorithm on chinese text

I am doing TF-IDF on chinese text and searching for top 10 used words in the text.
when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other.
Is there is any ways which only get meaningful words?
I am using "jieba" to cut the chinese sentence to words

The words like "成为", "表示" are what we refer to as stop words. In many cases, they are commonly used words that provide little meaning within the sentence, think the words "a", and "the" in English.

It is sometimes necessary to remove these stopwords before performing analysis, especially for TF-IDF as it may lead to meaningless results as you have seen.

It seems that Jieba doesn't include the functionality to remove stop words, but genediazjr collected a fairly comprehensive list of stopwords for the Chinese language. You can import this and remove these stop words from your original text before the TF-IDF analysis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM