简体   繁体   中英

scikit-learn: tfidf model representation

tldr: what does tfidf sparse matrix look like?

Assume I have the following.

descriptions = ["he liked dogs", "she liked cats", "she hated cars"]
tfidf = TfidfVectorizer()
trained_model = tfidf.fit_transform(descriptions)

Now I want to combine the tfidf scores with other features of documents, and give different weights. For example, I want to add length_document , num_words as features of a document. So each document should be represented as

d = [tfidf_score, length_document, num_words]

And then I will try to figure out the best weights for these three features to return the most similar document.

But first, I need to figure out what exactly trained_model looks like.

(Pdb) trained_model
<5801x8954 sparse matrix of type '<type 'numpy.float64'>'
    with 48369 stored elements in Compressed Sparse Row format>
(Pdb) trained_model[0]
<1x8954 sparse matrix of type '<type 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
(Pdb) trained_model[1]
<1x8954 sparse matrix of type '<type 'numpy.float64'>'
    with 11 stored elements in Compressed Sparse Row format>

There are 5801 documents in total, and they are represented by 8954 words in the corpus. Then what do x stored elements represent?

If you have time:

I assume that each document is represented by a vector whose length is 8954 in this case. If I just add two features at the end and make the vector length 8956, it wouldn't make sense to weigh them equally. I want to make the first 8954 features take 1/3 of the weight, and the last two 2/3. Does it make sense?

Each row in the matrix corresponds to a document. The rows are formatted according the Compressed Row Format. Only non zero terms are included.

So trained_model[0] which should return the tfidf vector for the first document has four entries one tfidf for each of the four distinct terms. And the second document has 11 tfidf for the 11 distinct distinct terms.

Regarding your weighting. If you are wanting to measure similarity of the documents you probably should use a distance metric such as cosign similarity on the tfidf vectors. Having 2/3 of the similarity assigned to essentially the length of the document may not be what you want

The tf-idf matrix is a sparse representation of a matrix that has your documents as the rows, let's call those D , and the individual terms (ie the vocabulary of words contained in those documents) representing the columns - let's call that T .

In a normal (dense) matrix representation (like an array) the machine will reserve DxT blocks of data, and populate empty cells with zero, or NaN or whatever, depending on your data type (TF-IDF matrices are likely to contain float type data, so you'd see zeros) It's just a big rectangular shaped memory block that can be referenced quickly by supplying a coordinate reference.

The sparse representation of a matrix saves space by assuming that most of the matrix is zeros, and writing in the non-zero values as belonging to some (x,y) tuple-like index. The x stored elements part means there are x non-zero elements within that matrix.

You can perform matrix-math a sparse matrix directly as long as you keep it simple, or, if you've got the memory, you can convert the matrix into a dense representation using the S.todense() function. Which allows you more flexibility on what you can do, at some cost of hosting a (in your example) 5801x8954xdatatype_size (if your datatype is a np.float64 then that's 8 ie np.dtype(np.float64).itemsize ) giving you 5801x8954x8=415537232 Which is by rough calculations 400MB . Which is probably manageable, as long as you're not juggling thousands of these at any one time.

Compare to the size of your first sparse matrix, containing 48369 values, each of 8 bytes and (probably another 4-8 bytes each for the indexing) that's roughly 3MB - a considerable memory saving!

If you're happy to work within the tf-idf representation (sparse or dense) you might be able to squeeze these additional metrics by injecting a couple of reserved keywords, like zzz_length_document or zzz_num_words into your vocabulary (the simplest way would be to append them to the end of your documents prior to tf-idf-ing them) and then tinkering with the associated DT cell values to toy with the weightings accordingly - you might find you need to tone down (normalise) the numbers in order for them not to dominate any vectorisation you perform on the final matrix, but a bit of experimentation should help reveal some suitable parameters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM