简体   繁体   中英

Create a matrix of tf-idf values

I have a set of documents like:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

and a set of words like:

"sky","land","sea","water","sun","moon"

I want to create a matrix like this:

   x        D1           D2         D3
sky         tf-idf       0          tf-idf
land        0            0          0
sea         0            0          0
water       0            0          0
sun         0            tf-idf     tf-idf
moon        0            0          0

Something like the example table given here: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html . In the given link, it uses the same words from the document but I need to use the set of words that I have mentioned.

If the particular word is present in the document then I put the tf-idf values, else I put a 0 in the matrix.

Any idea how I might build some sort of matrix like this? Python will be best but R also appreciated.

I am using the following code but am not sure whether I am doing the right thing or not. My code is:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords


train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
#print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

I am getting very absurd results like this (values are only 0 and 1 while I am expecting values between 0 and 1).

[[ 0.  0.  1.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]]   

I am also open to other libraries for calculating tf-idf . I just want a correct matrix which I mentioned above.

AR solution could look like this:

library(tm)
docs <- c(D1 = "The sky is blue.",
          D2 = "The sun is bright.",
          D3 = "The sun in the sky is bright.")
dict <- c("sky","land","sea","water","sun","moon")
mat <- TermDocumentMatrix(Corpus(VectorSource(docs)), 
                          control=list(weighting =  weightTfIdf, 
                                       dictionary = dict))
as.matrix(mat)[dict, ]
#         Docs
# Terms          D1        D2        D3
#   sky   0.5849625 0.0000000 0.2924813
#   land  0.0000000 0.0000000 0.0000000
#   sea   0.0000000 0.0000000 0.0000000
#   water 0.0000000 0.0000000 0.0000000
#   sun   0.0000000 0.5849625 0.2924813
#   moon  0.0000000 0.0000000 0.0000000

I believe what you want is

vectorizer = TfidfVectorizer(stop_words=stopWords, vocabulary=test_set)
matrix = vectorizer.fit_transform(train_set)

(As I said earlier, this is not a test set, this is a vocabulary.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM