简体   繁体   English

scikit-learn:tfidf模型表示

[英]scikit-learn: tfidf model representation

tldr: what does tfidf sparse matrix look like? tldr:tfidf稀疏矩阵是什么样的?

Assume I have the following. 假设我有以下内容。

descriptions = ["he liked dogs", "she liked cats", "she hated cars"]
tfidf = TfidfVectorizer()
trained_model = tfidf.fit_transform(descriptions)

Now I want to combine the tfidf scores with other features of documents, and give different weights. 现在,我想将tfidf得分与文档的其他功能结合起来,并赋予不同的权重。 For example, I want to add length_document , num_words as features of a document. 例如,我要添加length_documentnum_words作为文档的功能。 So each document should be represented as 因此,每个文档都应表示为

d = [tfidf_score, length_document, num_words]

And then I will try to figure out the best weights for these three features to return the most similar document. 然后,我将尝试找出这三个功能的最佳权重,以返回最相似的文档。

But first, I need to figure out what exactly trained_model looks like. 但是首先,我需要找出确切的trained_model是什么样的。

(Pdb) trained_model
<5801x8954 sparse matrix of type '<type 'numpy.float64'>'
    with 48369 stored elements in Compressed Sparse Row format>
(Pdb) trained_model[0]
<1x8954 sparse matrix of type '<type 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
(Pdb) trained_model[1]
<1x8954 sparse matrix of type '<type 'numpy.float64'>'
    with 11 stored elements in Compressed Sparse Row format>

There are 5801 documents in total, and they are represented by 8954 words in the corpus. 总共有5801个文档,并且在语料库中由8954个单词表示。 Then what do x stored elements represent? 那么x stored elements代表什么?

If you have time: 如果你有时间:

I assume that each document is represented by a vector whose length is 8954 in this case. 我假设在这种情况下,每个文档都由一个长度为8954的向量表示。 If I just add two features at the end and make the vector length 8956, it wouldn't make sense to weigh them equally. 如果仅在最后添加两个特征并使向量长度为​​8956,则对它们进行均等称重就没有意义。 I want to make the first 8954 features take 1/3 of the weight, and the last two 2/3. 我要使前8954个功能的重量为1/3,后两个功能的重量为2/3。 Does it make sense? 是否有意义?

Each row in the matrix corresponds to a document. 矩阵中的每一行都对应一个文档。 The rows are formatted according the Compressed Row Format. 根据压缩行格式对行进行格式化。 Only non zero terms are included. 仅包含非零项。

So trained_model[0] which should return the tfidf vector for the first document has four entries one tfidf for each of the four distinct terms. 因此,应该为第一个文档返回tfidf向量的training_model [0]具有四个条目,每个四个不同项中的每个为tfidf。 And the second document has 11 tfidf for the 11 distinct distinct terms. 第二个文档对11个不同的专有术语有11个tfidf。

Regarding your weighting. 关于您的权重。 If you are wanting to measure similarity of the documents you probably should use a distance metric such as cosign similarity on the tfidf vectors. 如果您想测量文档的相似性,则可能应该使用距离度量,例如tfidf向量上的cosign相似性。 Having 2/3 of the similarity assigned to essentially the length of the document may not be what you want 基本上将2/3的相似度分配给文档的长度可能不是您想要的

The tf-idf matrix is a sparse representation of a matrix that has your documents as the rows, let's call those D , and the individual terms (ie the vocabulary of words contained in those documents) representing the columns - let's call that T . tf-idf矩阵是一个矩阵的稀疏表示,该矩阵将您的文档作为行,我们称其为D ,而代表列的各个术语(即那些文档中所包含的单词的词汇)则称为T。

In a normal (dense) matrix representation (like an array) the machine will reserve DxT blocks of data, and populate empty cells with zero, or NaN or whatever, depending on your data type (TF-IDF matrices are likely to contain float type data, so you'd see zeros) It's just a big rectangular shaped memory block that can be referenced quickly by supplying a coordinate reference. 在正常(密集)矩阵表示形式(如数组)中,机器将保留DxT数据块,并根据数据类型填充零或NaN或其他内容(TF-IDF矩阵可能包含浮点类型)数据,所以您会看到零)。它只是一个大的矩形存储块,可以通过提供坐标参考来快速参考。

The sparse representation of a matrix saves space by assuming that most of the matrix is zeros, and writing in the non-zero values as belonging to some (x,y) tuple-like index. 矩阵的稀疏表示可通过假设大多数矩阵为零来节省空间,并将非零值写入属于某个(x,y)元组状索引。 The x stored elements part means there are x non-zero elements within that matrix. x stored elements部分表示该矩阵内有x个非零元素。

You can perform matrix-math a sparse matrix directly as long as you keep it simple, or, if you've got the memory, you can convert the matrix into a dense representation using the S.todense() function. 只要保持简单,就可以直接执行矩阵运算,或者,如果有内存,则可以使用S.todense()函数将矩阵转换为密集表示。 Which allows you more flexibility on what you can do, at some cost of hosting a (in your example) 5801x8954xdatatype_size (if your datatype is a np.float64 then that's 8 ie np.dtype(np.float64).itemsize ) giving you 5801x8954x8=415537232 Which is by rough calculations 400MB . 这使您可以更灵活地执行操作,但要花一些费用才能托管一个(在您的示例中)5801x8954xdatatype_size(如果您的数据类型是np.float64 8,即np.dtype(np.float64).itemsize ),为您提供5801x8954x8=415537232粗略计算为400MB Which is probably manageable, as long as you're not juggling thousands of these at any one time. 只要您一次不杂乱地处理数千个,这可能是可以管理的。

Compare to the size of your first sparse matrix, containing 48369 values, each of 8 bytes and (probably another 4-8 bytes each for the indexing) that's roughly 3MB - a considerable memory saving! 与第一个稀疏矩阵的大小相比,该矩阵包含48369个值,每个值8个字节,并且(每个索引可能另外4-8个字节)大约3MB节省大量内存!

If you're happy to work within the tf-idf representation (sparse or dense) you might be able to squeeze these additional metrics by injecting a couple of reserved keywords, like zzz_length_document or zzz_num_words into your vocabulary (the simplest way would be to append them to the end of your documents prior to tf-idf-ing them) and then tinkering with the associated DT cell values to toy with the weightings accordingly - you might find you need to tone down (normalise) the numbers in order for them not to dominate any vectorisation you perform on the final matrix, but a bit of experimentation should help reveal some suitable parameters. 如果您乐于使用tf-idf表示形式(稀疏或密集),则可以通过向词汇表中注入几个保留关键字(例如zzz_length_documentzzz_num_words )来挤压这些其他指标(最简单的方法是附加将它们放到文档的末尾,然后再对它们进行tf-idfing),然后修补相关的DT单元格值,以相应地权重进行计算-您可能会发现需要调低(归一化)这些数字,以免它们可以控制您在最终矩阵上执行的所有矢量化操作,但是进行一些实验应有助于揭示一些合适的参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM