I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:
a b c
0 4 3 3
1 1 2 1
2 2 1 0
3 2 0 1
4 2 0 4
I want to binaries the cell values.
I did the following:
df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))
While the code works fine, but it takes a lot of time to run.
Why is that?
Is there a faster way?
Thanks
Edit:
Error when doing df.to_pickle
df_preference.to_pickle('df_preference.pickle')
I get this:
---------------------------------------------------------------------------
SystemError Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
1 # Pickling the data to the disk
2
----> 3 df_preference.to_pickle('df_preference.pickle')
\\dwdfhome01\Anaconda\lib\site-packages\pandas\core\generic.pyc in to_pickle(self, path)
1170 """
1171 from pandas.io.pickle import to_pickle
-> 1172 return to_pickle(self, path)
1173
1174 def to_clipboard(self, excel=None, sep=None, **kwargs):
\\dwdfhome01\Anaconda\lib\site-packages\pandas\io\pickle.pyc in to_pickle(obj, path)
13 """
14 with open(path, 'wb') as f:
---> 15 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
16
17
SystemError: error return without exception set
UPDATE:
read this topic and this issue in regards to your error
Try to save your DF as HDF5 - it's much more convenient.
You may also want to read this comparison ...
OLD answer:
try this:
In [110]: (df>0).astype(np.int8)
Out[110]:
a b c
0 1 1 1
1 1 1 1
2 1 1 0
3 1 0 1
4 1 0 1
.applymap()
- one of the slowest method, because it goes to each cell (basically it performs nested loops inside).
df>0
works with vectorized data, so it does it much faster
.apply()
- will work faster than .applymap()
as it works on columns, but still much slower compared to df>0
UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap()
will take ages on (55K x 9K) DF:
In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))
In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop
In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop
In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop
You could use a scipy sparsematrix. This would make the calculations only relevant to the data that is actually there instead of operating on all the zeros.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.