Python：将列表与TF-IDF一起使用

Question

I have the following piece of code that currently compares all the words in the 'Tokens' with each respective document in the 'df'. 我有以下一段代码，当前将“令牌”中的所有单词与“ df”中的每个文档进行比较。 Is there any way I would be able to compare a predefined list of words with the documents instead of the 'Tokens'. 有什么办法可以将预定义的单词列表与文档（而不是“令牌”）进行比较。

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(norm=None)  

list_contents =[]
for index, row in df.iterrows():
    list_contents.append(' '.join(row.Tokens))

# list_contents = df.Content.values

tfidf_matrix = tfidf_vectorizer.fit_transform(list_contents)
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(),columns= [tfidf_vectorizer.get_feature_names()])
df_tfidf.head(10)

Any help is appreciated. 任何帮助表示赞赏。 Thank you! 谢谢！

Answer 1

Not sure if I understand you correctly, but if you want to make the Vectorizer consider a fixed list of words, you can use the vocabulary parameter. 不确定我是否理解正确，但是如果您想让Vectorizer考虑固定的单词列表，则可以使用vocabulary参数。

my_words = ["foo","bar","baz"]

# set the vocabulary parameter with your list of words
tfidf_vectorizer = TfidfVectorizer(
    norm=None,
    vocabulary=my_words)  

list_contents =[]
for index, row in df.iterrows():
    list_contents.append(' '.join(row.Tokens))

# this matrix will have only 3 columns because we have forced
# the vectorizer to use just the words foo bar and baz
# so it'll ignore all other words in the documents.
tfidf_matrix = tfidf_vectorizer.fit_transform(list_contents)

Python：将列表与TF-IDF一起使用

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-10-21 01:14:57

Python：将列表与TF-IDF一起使用

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-10-21 01:14:57

解决方案1
0 已采纳 2018-10-21 01:14:57