Scikit - TF-IDF empty vocabulary

Question

I have to calculate the distance/similarity of two or more texts. Some texts are genuinely really small or do not form proper english words etc, "A1024515". This means that it should accept every single word in the list.

As a test case, I have used the following list as a corpus of words.

words= ['A', 'A', 'A']

vect = TfidfVectorizer(min_df =0)
dtm = vect.fit_transform(words)
df_tf_idf = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

However, I get the following error

ValueError: empty vocabulary; perhaps the documents only contain stop words

How can I ensure that the list is accepted as possible words and ensure stop words are not removed from the corpus?

Answer 1

The problem is not the stopwords, there are no stopwords by default. Problem is that the sentences in your test case is too short (1 character).

By default tfidfVectorizer uses r'(?u)\\b\\w\\w+\\b' to tokenize given corpora of sentences into list of words. Which doesn't work with single character strings.

sklearn.feature_extraction.text.TfidfVectorizer(... token_pattern=’(?u)\b\w\w+\b’, ...)

You can use your own regex, give a tokenizer as constructor argument (in that case, given tokenizer overrides the regex). Or use a longer, more realistic test case.

Answer 2

Referring to the answer for the question: " CountVectorizer raising error on short words ":

words= ['A', 'A', 'A']

vect = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b')
dtm = vect.fit_transform(words)

vect.get_feature_names()

gives the output:

['a']

Scikit - TF-IDF empty vocabulary

Question

2 answers

solution1
4 ACCPTED 2018-02-26 08:32:30

solution2
2 2018-02-26 08:51:33

Scikit - TF-IDF empty vocabulary

Question

2 answers

solution1 4 ACCPTED 2018-02-26 08:32:30

solution2 2 2018-02-26 08:51:33

solution1
4 ACCPTED 2018-02-26 08:32:30

solution2
2 2018-02-26 08:51:33