简体   繁体   中英

Python Applymap taking time to run

I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:

 a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

I want to binaries the cell values.

I did the following:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

While the code works fine, but it takes a lot of time to run.

Why is that?

Is there a faster way?

Thanks

Edit:

Error when doing df.to_pickle

df_preference.to_pickle('df_preference.pickle')

I get this:

---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
      1 # Pickling the data to the disk
      2 
----> 3 df_preference.to_pickle('df_preference.pickle')

\\dwdfhome01\Anaconda\lib\site-packages\pandas\core\generic.pyc in to_pickle(self, path)
   1170         """
   1171         from pandas.io.pickle import to_pickle
-> 1172         return to_pickle(self, path)
   1173 
   1174     def to_clipboard(self, excel=None, sep=None, **kwargs):

\\dwdfhome01\Anaconda\lib\site-packages\pandas\io\pickle.pyc in to_pickle(obj, path)
     13     """
     14     with open(path, 'wb') as f:
---> 15         pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
     16 
     17 

SystemError: error return without exception set

UPDATE:

read this topic and this issue in regards to your error

Try to save your DF as HDF5 - it's much more convenient.

You may also want to read this comparison ...

OLD answer:

try this:

In [110]: (df>0).astype(np.int8)
Out[110]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

.applymap() - one of the slowest method, because it goes to each cell (basically it performs nested loops inside).

df>0 works with vectorized data, so it does it much faster

.apply() - will work faster than .applymap() as it works on columns, but still much slower compared to df>0

UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap() will take ages on (55K x 9K) DF:

In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))

In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop

In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop

In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop

You could use a scipy sparsematrix. This would make the calculations only relevant to the data that is actually there instead of operating on all the zeros.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM