Create a matrix of tf-idf values

Question

I have a set of documents like:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

and a set of words like:

"sky","land","sea","water","sun","moon"

I want to create a matrix like this:

   x        D1           D2         D3
sky         tf-idf       0          tf-idf
land        0            0          0
sea         0            0          0
water       0            0          0
sun         0            tf-idf     tf-idf
moon        0            0          0

Something like the example table given here: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html . In the given link, it uses the same words from the document but I need to use the set of words that I have mentioned.

If the particular word is present in the document then I put the tf-idf values, else I put a 0 in the matrix.

Any idea how I might build some sort of matrix like this? Python will be best but R also appreciated.

I am using the following code but am not sure whether I am doing the right thing or not. My code is:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords


train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
#print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

I am getting very absurd results like this (values are only 0 and 1 while I am expecting values between 0 and 1).

[[ 0.  0.  1.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]]

I am also open to other libraries for calculating tf-idf . I just want a correct matrix which I mentioned above.

Answer 1

AR solution could look like this:

library(tm)
docs <- c(D1 = "The sky is blue.",
          D2 = "The sun is bright.",
          D3 = "The sun in the sky is bright.")
dict <- c("sky","land","sea","water","sun","moon")
mat <- TermDocumentMatrix(Corpus(VectorSource(docs)), 
                          control=list(weighting =  weightTfIdf, 
                                       dictionary = dict))
as.matrix(mat)[dict, ]
#         Docs
# Terms          D1        D2        D3
#   sky   0.5849625 0.0000000 0.2924813
#   land  0.0000000 0.0000000 0.0000000
#   sea   0.0000000 0.0000000 0.0000000
#   water 0.0000000 0.0000000 0.0000000
#   sun   0.0000000 0.5849625 0.2924813
#   moon  0.0000000 0.0000000 0.0000000

Answer 2

I believe what you want is

vectorizer = TfidfVectorizer(stop_words=stopWords, vocabulary=test_set)
matrix = vectorizer.fit_transform(train_set)

(As I said earlier, this is not a test set, this is a vocabulary.)

Create a matrix of tf-idf values

Question

2 answers

solution1
2 2014-06-02 18:52:19

solution2
1 2014-06-02 19:45:13

Create a matrix of tf-idf values

Question

2 answers

solution1 2 2014-06-02 18:52:19

solution2 1 2014-06-02 19:45:13

solution1
2 2014-06-02 18:52:19

solution2
1 2014-06-02 19:45:13