简体繁体中英

TF-IDF algorithm on chinese text

原文 2020-07-23 09:09:01 0 1 python/ tf-idf/ tfidfvectorizer

I am doing TF-IDF on chinese text and searching for top 10 used words in the text.
when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other.
Is there is any ways which only get meaningful words?
I am using "jieba" to cut the chinese sentence to words

1 answers

The words like "成为", "表示" are what we refer to as stop words. In many cases, they are commonly used words that provide little meaning within the sentence, think the words "a", and "the" in English.

It is sometimes necessary to remove these stopwords before performing analysis, especially for TF-IDF as it may lead to meaningless results as you have seen.

It seems that Jieba doesn't include the functionality to remove stop words, but genediazjr collected a fairly comprehensive list of stopwords for the Chinese language. You can import this and remove these stop words from your original text before the TF-IDF analysis.

How to apply tf-idf to rows of text

tf-idf for text cluster-analysis

TF-IDF function

KNN for Text Classification using TF-IDF scores

How to use bag of words or tf-idf to classify text

finding the number of clusters in a vectorized text document with sklearn tf-idf

Reusing an sklearn text classification model with tf-idf feature selection

TF-IDF by string line rather than whole text document

Python - Using TF-IDF to summarise dataframe text column

TF-IDF Matrix In Python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to apply tf-idf to rows of text tf-idf for text cluster-analysis TF-IDF function KNN for Text Classification using TF-IDF scores How to use bag of words or tf-idf to classify text finding the number of clusters in a vectorized text document with sklearn tf-idf Reusing an sklearn text classification model with tf-idf feature selection TF-IDF by string line rather than whole text document Python - Using TF-IDF to summarise dataframe text column TF-IDF Matrix In Python

Related Tags

TF-IDF algorithm on chinese text

Question

1 answers

solution1 1 2020-07-27 10:06:56

solution1
1 2020-07-27 10:06:56