简体   繁体   English

使 tfidf 向量化器作为文档数量的特征返回

[英]Make tfidf vectorizer returns as features as the number of documents

I am using the Sklearn TfidfVectorizer that I fit on N documents and than I want to get a vector representation of a word based on its tfidf score in each documents.我正在使用适合N 个文档的 Sklearn TfidfVectorizer,并且我想根据每个文档中的 tfidf 分数获得单词的向量表示。

Some code could help:一些代码可以帮助:

from sklearn.feature_extraction.text import TfidfVectorizer

model = TfidfVectorizer()

corpus = ["first corpus with words like dog and cat", "second corpus with words like car and plane"]

model.fit(corpus)

model.transform(["cat"]).toarray().shape

>> (1, 11)

Why am I getting 11 feature ?为什么我会得到 11 个功能? I am expecting to have 2 features as I fitted the model with only two documents.我希望有 2 个功能,因为我只用两个文档安装了模型。

So What I wont is something like :所以我不会是这样的:

[0, tfidfscore]

I read the documentation and with a basic understanding about TF-IDF, I could arrive at a conclusion.我阅读了文档,对 TF-IDF 有了基本的了解,我可以得出一个结论。 This is not an expert opinion.这不是专家意见。

As per the documentation , transform returns a sparse matrix whose dimensions are (n_samples, n_features).根据文档,变换返回一个稀疏矩阵,其维度为 (n_samples, n_features)。

Returns X:sparse matrix, [n_samples, n_features]返回X:稀疏矩阵,[n_samples, n_features]

Tf-idf-weighted document-term matrix. Tf-idf 加权文档-术语矩阵。

Now your n_samples is 1 and n_features is coming from the model ... which it computed to be equal to 11.现在你的 n_samples 是 1 并且 n_features 来自模型......它计算出等于 11。

What is returned by transform is TD-IDF weighted Document-term-matrix where every row corresponds to document and every column is features.转换返回的是 TD-IDF 加权文档-项-矩阵,其中每一行对应于文档,每一列都是特征。

You can know your features by saying "print(model.get_feature_names())".你可以通过说“print(model.get_feature_names())”来了解你的特征。 In your case it will give the following output.在您的情况下,它将提供以下输出。

['and', 'car', 'cat', 'corpus', 'dog', 'first', 'like', 'plane', 'second', 'with', 'words']

As you can see there are 11 features.如您所见,有 11 个功能。 Now cat is the third element for which the frequency must be high.现在 cat 是频率必须很高的第三个元素。 If you say "print(model.transform(["cat"]).toarray())" you will know the entire matrix.如果你说“print(model.transform(["cat"]).toarray())”,你就会知道整个矩阵。 As said before there will be one row (as you passed in one document .. "cat" and 11 columns (due to the reason above). As you can see below at the third column the frequency is highest 1.00.如前所述,将有一行(当您传入一个文档..“cat”和 11 列(由于上述原因)。正如您在下面的第三列中看到的那样,频率最高为 1.00。

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

I have done minimal changes to your code with an intention that it will help you.我对您的代码做了最小的更改,旨在帮助您。

from sklearn.feature_extraction.text import TfidfVectorizer

model = TfidfVectorizer()

corpus = ["first corpus with words like dog and cat", "second corpus with words like car and plane"]

model.fit(corpus)

Returned_Features = model.get_feature_names()
Returned_TF_IDF_DTM = model.transform(["cat"]).toarray()

print(Returned_Features)
print(Returned_TF_IDF_DTM)

I hope it helps.我希望它有帮助。 All the best祝一切顺利

Seems that you want to do something like this似乎你想做这样的事情

from sklearn.feature_extraction.text import TfidfVectorizer

model = TfidfVectorizer()

corpus = ["first corpus with words like dog and cat", "second corpus with words like car and plane"]

X = model.fit_transform(corpus)

words = model.get_feature_names()
word2idx = dict(zip(words, range(len(words))))

print(X[:, word2idx['cat']].todense())

This well give you the coressponding raw of the word "cat" in the tf-idf matrix这很好地为您提供了 tf-idf 矩阵中单词“cat”的核心响应

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM