I have the following dataset:
test_set = ("The sun in the sky", "The sun in the light", "Do not blame it on moonlight", "Do not blame it on sunshine")
Now I use the following code to create a tf-idf matrix
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(test_set)
smatrix = vectorizer.transform(test_set)
smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)
What I would like to do now is to 'feed' this matrix to a knn cluster algorithm. So for example like this:
import pandas as pd
df = pd.DataFrame([[0.2, 0.3, 0.4], [0.2, 0.3, 0.41], [0.2, 0.1, 0.05], [0.1, 0.1, 0.08]], columns=('column1', 'column2', 'column3'))
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(df)
print(k_means.labels_)
I cant seem to convert the matrix into a df however. If I do:
df = pd.DataFrame(tf_idf_matrix)
I get
Traceback (most recent call last):
File "/Users/marcvanderpeet/PycharmProjects/untitled/test.py", line 47, in <module>
df = pd.DataFrame(tf_idf_matrix)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 345, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!
Any thoughts on how I can convert this?
tf_idf_matrix
has a type scipy.sparse.csr.csr_matrix
. You can check this by typing type(tf_idf_matrix)
. In pandas documentation for pd.DataFrame class we can see, that it's possible to get an instance of the class passing only numpy ndarray (structured or homogeneous), dict, or DataFrame. To convert tf_idf_matrix
to numpy representation you can do following: tf_idf_matrix = tf_idf_matrix.todense()
. This line can transform scipy.sparse.csr.csr_matrix
to numpy.matrixlib.defmatrix.matrix
and pd.DataFrame can work with data of this type. After that you can get df
and pass it to k_means.fit()
method.
Note that since version 0.20 you can directly use scipy sparse matrices to create a pandas SparseDataFrame :
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)
We Can also Use Sklearn Pipeline
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans
test_set = ["The sun in the sky", "The sun in the light", "Do not blame it on
moonlight", "Do not blame it on sunshine"]
df = pd.DataFrame(test_set, columns =['sent'])
print(df)
sent
0 The sun in the sky
1 The sun in the light
2 Do not blame it on moonlight
3 Do not blame it on sunshine
model = Pipeline([('vectorizer',CountVectorizer()), ('tf_trans',TfidfTransformer()),('k_means', KMeans(n_clusters=2))])
# and now we can just data directly pass the data to the model
model.fit(df)
# Now if we want to predict new comment we have to just pass
print(model.predict(['enjoy sunshine ']))
o/p-->array([0])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.