简体   繁体   中英

Convert a tf-idf matrix in a pandas dataframe

I have the following dataset:

test_set = ("The sun in the sky", "The sun in the light", "Do not blame it on moonlight", "Do not blame it on sunshine")

Now I use the following code to create a tf-idf matrix

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit_transform(test_set)

smatrix = vectorizer.transform(test_set)
smatrix.todense()

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)

What I would like to do now is to 'feed' this matrix to a knn cluster algorithm. So for example like this:

import pandas as pd
df = pd.DataFrame([[0.2, 0.3, 0.4], [0.2, 0.3, 0.41], [0.2, 0.1, 0.05], [0.1, 0.1, 0.08]], columns=('column1', 'column2', 'column3'))

k_means = cluster.KMeans(n_clusters=2) 
k_means.fit(df)
print(k_means.labels_)

I cant seem to convert the matrix into a df however. If I do:

df = pd.DataFrame(tf_idf_matrix)

I get

Traceback (most recent call last):
File "/Users/marcvanderpeet/PycharmProjects/untitled/test.py", line 47, in <module>
df = pd.DataFrame(tf_idf_matrix)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 345, in __init__
raise PandasError('DataFrame constructor not properly called!')

pandas.core.common.PandasError: DataFrame constructor not properly called!

Any thoughts on how I can convert this?

tf_idf_matrix has a type scipy.sparse.csr.csr_matrix . You can check this by typing type(tf_idf_matrix) . In pandas documentation for pd.DataFrame class we can see, that it's possible to get an instance of the class passing only numpy ndarray (structured or homogeneous), dict, or DataFrame. To convert tf_idf_matrix to numpy representation you can do following: tf_idf_matrix = tf_idf_matrix.todense() . This line can transform scipy.sparse.csr.csr_matrix to numpy.matrixlib.defmatrix.matrix and pd.DataFrame can work with data of this type. After that you can get df and pass it to k_means.fit() method.

Note that since version 0.20 you can directly use scipy sparse matrices to create a pandas SparseDataFrame :

sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)

We Can also Use Sklearn Pipeline

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer   
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans

test_set = ["The sun in the sky", "The sun in the light", "Do not blame it on 
           moonlight", "Do not blame it on sunshine"]

df = pd.DataFrame(test_set, columns =['sent'])
print(df)
                           sent
0            The sun in the sky
1          The sun in the light
2  Do not blame it on moonlight
3  Do not blame it on sunshine

model =  Pipeline([('vectorizer',CountVectorizer()), ('tf_trans',TfidfTransformer()),('k_means', KMeans(n_clusters=2))])


# and now we can just data directly pass the data to the model
model.fit(df)




# Now if we want to predict new comment we have to just pass
print(model.predict(['enjoy sunshine ']))
o/p-->array([0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM