简体   繁体   English

Scikit-TF-IDF空词汇

[英]Scikit - TF-IDF empty vocabulary

I have to calculate the distance/similarity of two or more texts. 我必须计算两个或多个文本的距离/相似度。 Some texts are genuinely really small or do not form proper english words etc, "A1024515". 有些文本确实很小,或者没有形成适当的英语单词,例如“ A1024515”。 This means that it should accept every single word in the list. 这意味着它应该接受列表中的每个单词。

As a test case, I have used the following list as a corpus of words. 作为测试用例,我将以下列表用作一个语料库。

words= ['A', 'A', 'A']

vect = TfidfVectorizer(min_df =0)
dtm = vect.fit_transform(words)
df_tf_idf = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

However, I get the following error 但是,出现以下错误

ValueError: empty vocabulary; perhaps the documents only contain stop words

How can I ensure that the list is accepted as possible words and ensure stop words are not removed from the corpus? 如何确保列表被接受为可能的单词,并确保不从语料库中删除停用词?

The problem is not the stopwords, there are no stopwords by default. 问题不在于停用词,默认情况下没有停用词。 Problem is that the sentences in your test case is too short (1 character). 问题是测试案例中的句子太短(1个字符)。

By default tfidfVectorizer uses r'(?u)\\b\\w\\w+\\b' to tokenize given corpora of sentences into list of words. 默认情况下, tfidfVectorizer使用r'(?u)\\b\\w\\w+\\b'将给定的句子集标记为单词列表。 Which doesn't work with single character strings. 不适用于单个字符串。

sklearn.feature_extraction.text.TfidfVectorizer(... token_pattern=’(?u)\b\w\w+\b’, ...)

You can use your own regex, give a tokenizer as constructor argument (in that case, given tokenizer overrides the regex). 您可以使用自己的正则表达式,给令牌化器作为构造函数参数(在这种情况下,给定令牌化器会覆盖正则表达式)。 Or use a longer, more realistic test case. 或者使用更长,更实际的测试用例。

Referring to the answer for the question: " CountVectorizer raising error on short words ": 请参考以下问题的答案:“ CountVectorizer对短字引发错误 ”:

words= ['A', 'A', 'A']

vect = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b')
dtm = vect.fit_transform(words)

vect.get_feature_names()

gives the output: 给出输出:

['a']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM