简体   繁体   中英

one hot encoded sparse matrix in python

I want to create one hot encoded features as sparse matrix. I am trying to use pd.get_dummies with sparse flag set to True as given below.

X = df.iloc[:, :2]
y = df.iloc[:, -1]
X = pd.get_dummies(X, columns = ['id', 'video_id'], sparse=True)

But this does not seem to give expected results. All I get is one hot encoded matrix but not CSR matrix. what is correct way to create one-hot-encoded sparse matrix?

Thanks in advance

To get the sparse matrix you can use scipy.sparse.csr_matrix as described here: Convert Pandas dataframe to Sparse Numpy Matrix directly

import pandas as pd
import scipy

test_df = pd.DataFrame(np.arange(10), columns = ['category'])

scipy.sparse.csr_matrix(pd.get_dummies(test_df).values
                       )

Output

<10x1 sparse matrix of type '<class 'numpy.longlong'>'
    with 9 stored elements in Compressed Sparse Row format>

Setting sparse = True has to do with types of objects ( np.array vs SparseArray ) used internally to produce the output ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html ):

sparse : Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

If you set sparse = True it accelerates your code several times:

  • Getting dummies with sparse = True
%timeit pd.get_dummies(test_df.category, sparse=True)

Output

2.21 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Getting dummies with sparse = False
%timeit pd.get_dummies(test_df.category, sparse=False)

Output

454 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM