简体   繁体   中英

Scikit - TF-IDF empty vocabulary

I have to calculate the distance/similarity of two or more texts. Some texts are genuinely really small or do not form proper english words etc, "A1024515". This means that it should accept every single word in the list.

As a test case, I have used the following list as a corpus of words.

words= ['A', 'A', 'A']

vect = TfidfVectorizer(min_df =0)
dtm = vect.fit_transform(words)
df_tf_idf = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

However, I get the following error

ValueError: empty vocabulary; perhaps the documents only contain stop words

How can I ensure that the list is accepted as possible words and ensure stop words are not removed from the corpus?

The problem is not the stopwords, there are no stopwords by default. Problem is that the sentences in your test case is too short (1 character).

By default tfidfVectorizer uses r'(?u)\\b\\w\\w+\\b' to tokenize given corpora of sentences into list of words. Which doesn't work with single character strings.

sklearn.feature_extraction.text.TfidfVectorizer(... token_pattern=’(?u)\b\w\w+\b’, ...)

You can use your own regex, give a tokenizer as constructor argument (in that case, given tokenizer overrides the regex). Or use a longer, more realistic test case.

Referring to the answer for the question: " CountVectorizer raising error on short words ":

words= ['A', 'A', 'A']

vect = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b')
dtm = vect.fit_transform(words)

vect.get_feature_names()

gives the output:

['a']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM