简体   繁体   中英

Creating a TF-IDF Matrix Python 3.6

I have 100 documents(Each document is a simple list of words in that document). Now I want to create a TF-IDF matrix so that I can create a small word search by rank. I tried it using a tfidfVectorizer but got lost in the syntax. Any help would be much appreciated. Regards.

Edit : I converted the lists into strings and added them into a parent list:

vectorizer = TfidfVectorizer(vocabulary=word_set)
matrix = vectorizer.fit_transform(doc_strings)
print(matrix)

Here word_set is the set of possible distinct words and doc_strings is a list that contains each document as a string; However when I print the matrix I get output as below :

  (0, 839)  0.299458532286
  (0, 710)  0.420878518454
  (0, 666)  0.210439259227
  (0, 646)  0.149729266143
  (0, 550)  0.210439259227
  (0, 549)  0.210439259227
  (0, 508)  0.210439259227
  (0, 492)  0.149729266143
  (0, 479)  0.149729266143
  (0, 425)  0.149729266143
  (0, 401)  0.210439259227
  (0, 332)  0.210439259227
  (0, 310)  0.210439259227
  (0, 253)  0.149729266143
  (0, 216)  0.210439259227
  (0, 176)  0.149729266143
  (0, 122)  0.149729266143
  (0, 119)  0.210439259227
  (0, 111)  0.149729266143
  (0, 46)   0.210439259227
  (0, 26)   0.210439259227
  (0, 11)   0.149729266143
  (0, 0)    0.210439259227
  (1, 843)  0.0144007295367
  (1, 842)  0.0288014590734
  (1, 25)   0.0144007295367
  (1, 24)   0.0144007295367
  (1, 23)   0.0432021886101
  (1, 22)   0.0144007295367
  (1, 21)   0.0288014590734
  (1, 20)   0.0288014590734
  (1, 19)   0.0288014590734
  (1, 18)   0.0432021886101
  (1, 17)   0.0288014590734
  (1, 16)   0.0144007295367
  (1, 15)   0.0144007295367
  (1, 14)   0.0432021886101
  (1, 13)   0.0288014590734
  (1, 12)   0.0144007295367
  (1, 11)   0.0102462376715
  (1, 10)   0.0144007295367
  (1, 9)    0.0288014590734
  (1, 8)    0.0288014590734
  (1, 7)    0.0144007295367
  (1, 6)    0.0144007295367
  (1, 5)    0.0144007295367
  (1, 4)    0.0144007295367
  (1, 3)    0.0144007295367
  (1, 2)    0.0288014590734
  (1, 1)    0.0144007295367

Is this correct and If so, how can I search for the rank of a given word in a particular document.

Your code is working fine. I am giving an example with a couple of sentences. Here one sentence is equivalent to a document. Hopefully this will help you.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["welcome to stackoverflow my friend", 
          "my friend, don't worry, you can get help from stackoverflow"]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
print(matrix)

As we know that fit_transform() returns a tf-idf-weighted document-term matrix.

The print() statement outputs the following:

  (0, 2)    0.379303492809
  (0, 6)    0.379303492809
  (0, 7)    0.379303492809
  (0, 8)    0.533097824526
  (0, 9)    0.533097824526
  (1, 3)    0.342619853089
  (1, 5)    0.342619853089
  (1, 4)    0.342619853089
  (1, 0)    0.342619853089
  (1, 11)   0.342619853089
  (1, 10)   0.342619853089
  (1, 1)    0.342619853089
  (1, 2)    0.243776847332
  (1, 6)    0.243776847332
  (1, 7)    0.243776847332

So, how can we interpret this matrix? You can see a tuple (x, y) and a value in each row. Here the tuple represents, document no. (in this case sentence no.) and feature no.

To better understand, lets print the list of features (in our case, features are words) and their index.

for i, feature in enumerate(vectorizer.get_feature_names()):
    print(i, feature)

It outputs:

0 can
1 don
2 friend
3 from
4 get
5 help
6 my
7 stackoverflow
8 to
9 welcome
10 worry
11 you

So, welcome to stackoverflow my friend sentence is transformed to the following.

(0, 2)  0.379303492809
(0, 6)  0.379303492809
(0, 7)  0.379303492809
(0, 8)  0.533097824526
(0, 9)  0.533097824526

For example, the first two row values can be interpreted as follows.

0 = sentence no.
2 = word index (index of the word `friend`)
0.379303492809 = tf-idf weight

0 = sentence no.
6 = word index (index of the word `my`)
0.379303492809 = tf-idf weight

From the tf-idf values, you can see, the words welcome and to should rank higher than the other words in sentence 1.

You can extend this example to search for the rank of a given word in a particular sentence or document to fulfill your need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM