I'm doing some basic machine learning and have a sparse matrix resulting from TFIDF as follows:
<983x33599 sparse matrix of type '<type 'numpy.float64'>'
with 232944 stored elements in Compressed Sparse Row format>
Then I have a DataFrame with a title
column. I want to combine these into one DataFrame but when I try to use concat
, I get that I can't combine a DataFrame with a non-DataFrame object.
How do I get around this?
Thanks!
Consider the following demo:
Source DF:
In [2]: df
Out[2]:
text
0 is it good movie
1 wooow is it very goode
2 bad movie
Solution: let's create a SparseDataFrame out of TFIDF sparse matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
sdf = pd.SparseDataFrame(vect.fit_transform(df['text']),
columns=vect.get_feature_names(),
default_fill_value=0)
sdf['text'] = df['text']
Result:
In [13]: sdf
Out[13]:
bad good goode wooow text
0 0.0 1.0 0.000000 0.000000 is it good movie
1 0.0 0.0 0.707107 0.707107 wooow is it very goode
2 1.0 0.0 0.000000 0.000000 bad movie
In [14]: sdf.memory_usage()
Out[14]:
Index 80
bad 8
good 8
goode 8
wooow 8
text 24
dtype: int64
PS pay attention at .memory_usage()
- we didn't lose the "spareness". If we would use pd.concat
, join
, merge
, etc. - we would lose the "sparseness" as all these methods generate a new regular (not sparsed) copy of merged DataFrames
Maybe you can try using to_dense()
on the sparse matrix before doing the concatenation, and later convert back to a sparse matrix with to_sparse()
. Hope it helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.