I'm working with ~300 MB financial data in Pandas, that corresponds to the limit orders in an auction. It is multi-dimensional data, and looks like this:
bid ask
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity
2014-05-13 08:47:16.180000 102.298 1000000 102.297 1500000 102.296 6500000 102.295 8000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 6100000 102.304 14700000 102.305 3500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.003000 102.298 1000000 102.297 2500000 102.296 6500000 102.295 7000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.005000 102.298 3000000 102.297 3500000 102.296 6000000 102.295 9300000 102.294 4000000 102.293 17500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9000000 102.308 16300000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.006000 102.299 1000000 102.298 3000000 102.297 6500000 102.296 5000000 102.295 5300000 102.294 4000000 102.293 15500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 5100000 102.304 11700000 102.305 7500000 102.307 9000000 102.308 11300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.007000 102.299 1000000 102.298 3000000 102.297 8500000 102.296 4000000 102.295 4300000 102.294 5000000 102.293 14500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 4100000 102.304 13700000 102.305 7500000 102.307 8000000 102.308 12300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
(Note the 1st level changes when you get to 20. Sorry about the long format of the table ...)
There are a number of pivot operations I need to do to work with the data. For example, instead of having 0,1,2,3 ... (the relative position of an order in a queue), they have 102.297, 102.296, ... ie the price of the order as an index. He're an example of such an operation:
x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
yielding:
quantity
side price
2014-05-13 08:47:17.003000 ask 102.300 0
102.301 0
102.302 0
102.303 -1000000
102.304 0
This can be achieved by a combination of stack/unstack/reset_index
, but it appears to be really inefficient. I haven't looked at the code, but I'm guessing a copy of the table is made on each stack
/ unstack
, causing my 8GB system to run out of memory and start hitting the page file. I don't think I can use pivot
in this case either, because the required columns are in a multi-index
Any suggestions as to how I can speed this up?
Here is an example input csv file, as per comment:
side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
Unstack essentially creates an enumeration of index x columns so it can create a huge memory space when you have a lot of columns and rows.
Here is a soln, that is slower, but should have a much lower peak memory usage (I think). It gives a slightly smaller total space, in that you may have some zero entries in the original that are not here (but you could always reindex and fill to fix that).
Define this function, this could probably be optimized for this case (as already grouping on level)
In [79]: def f(x):
try:
y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
return y[y!=0].dropna()
except:
return None
....:
Groupby the 'level' on the columns and apply f; don't use apply directly, but just concat the results as rows (this is the 'unstacking' part).
However this creates dups (on the price level), so need to aggregate them.
In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]:
value quantity
side price
2014-05-13 08:47:17.003 ask 102.303 -1000000
102.305 1000000
bid 102.295 -1000000
102.297 1000000
2014-05-13 08:47:17.005 ask 102.307 -800000
102.308 800000
bid 102.288 -5000000
102.291 3000000
102.292 -4000000
102.293 -6800000
102.294 1000000
102.295 2300000
102.296 -500000
102.297 1000000
102.298 2000000
2014-05-13 08:47:17.006 ask 102.304 -3000000
102.305 3000000
102.308 -5000000
102.309 5000000
102.310 0
102.312 0
102.313 0
102.315 0
102.316 0
102.320 0
102.321 0
102.324 0
102.325 0
bid 102.265 -9500000
102.275 0
102.276 0
102.280 0
102.284 0
102.286 0
102.287 0
102.288 0
102.289 0
102.290 0
102.291 0
102.292 0
102.293 -2000000
102.294 0
102.295 -4000000
102.296 -1000000
102.297 3000000
102.298 0
102.299 1000000
2014-05-13 08:47:17.007 ask 102.303 -1000000
102.304 2000000
102.307 -1000000
102.308 1000000
bid 102.293 -1000000
102.294 1000000
102.295 -1000000
102.296 -1000000
102.297 2000000
Timings (I think that optimizing f will make this quite a bit faster)
In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop
In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop
Original method
In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop
In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.