简体   繁体   中英

Pandas Pivot MultiIndex efficiently

I'm working with ~300 MB financial data in Pandas, that corresponds to the limit orders in an auction. It is multi-dimensional data, and looks like this:

                                bid                                                                                                                                                                                                                                                                                                                                                                                       ask                                                                                                                                                                                                                                                                                                                                                                           
                                  0                  1                  2                  3                  4                  5                  6                  7                  8                  9                 10                 11                 12                 13                 14                 15                 16                 17                 18               19                  0                  1                  2                  3                  4                  5                  6                  7                  8                  9                 10                 11                 12                 13                 14                 15               16               17               18               19          
                              price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity  price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity  price  quantity  price  quantity  price  quantity  price  quantity
2014-05-13 08:47:16.180000  102.298   1000000  102.297   1500000  102.296   6500000  102.295   8000000  102.294   3000000  102.293  24300000  102.292   6000000  102.291   1000000  102.290   1000000  102.289   2500000  102.288  11000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   6100000  102.304  14700000  102.305   3500000  102.307   9800000  102.308  15500000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.003000  102.298   1000000  102.297   2500000  102.296   6500000  102.295   7000000  102.294   3000000  102.293  24300000  102.292   6000000  102.291   1000000  102.290   1000000  102.289   2500000  102.288  11000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   5100000  102.304  14700000  102.305   4500000  102.307   9800000  102.308  15500000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.005000  102.298   3000000  102.297   3500000  102.296   6000000  102.295   9300000  102.294   4000000  102.293  17500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   5100000  102.304  14700000  102.305   4500000  102.307   9000000  102.308  16300000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.006000  102.299   1000000  102.298   3000000  102.297   6500000  102.296   5000000  102.295   5300000  102.294   4000000  102.293  15500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000    NaN       NaN  102.302   2000000  102.303   5100000  102.304  11700000  102.305   7500000  102.307   9000000  102.308  11300000  102.309   5000000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.007000  102.299   1000000  102.298   3000000  102.297   8500000  102.296   4000000  102.295   4300000  102.294   5000000  102.293  14500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000    NaN       NaN  102.302   2000000  102.303   4100000  102.304  13700000  102.305   7500000  102.307   8000000  102.308  12300000  102.309   5000000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN

(Note the 1st level changes when you get to 20. Sorry about the long format of the table ...)

There are a number of pivot operations I need to do to work with the data. For example, instead of having 0,1,2,3 ... (the relative position of an order in a queue), they have 102.297, 102.296, ... ie the price of the order as an index. He're an example of such an operation:

x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])

yielding:

                                         quantity
                           side price            
2014-05-13 08:47:17.003000 ask  102.300         0
                                102.301         0
                                102.302         0
                                102.303  -1000000
                                102.304         0

This can be achieved by a combination of stack/unstack/reset_index , but it appears to be really inefficient. I haven't looked at the code, but I'm guessing a copy of the table is made on each stack / unstack , causing my 8GB system to run out of memory and start hitting the page file. I don't think I can use pivot in this case either, because the required columns are in a multi-index

Any suggestions as to how I can speed this up?

Here is an example input csv file, as per comment:

side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A

Unstack essentially creates an enumeration of index x columns so it can create a huge memory space when you have a lot of columns and rows.

Here is a soln, that is slower, but should have a much lower peak memory usage (I think). It gives a slightly smaller total space, in that you may have some zero entries in the original that are not here (but you could always reindex and fill to fix that).

Define this function, this could probably be optimized for this case (as already grouping on level)

In [79]: def f(x):                                                              
    try:
        y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
        return y[y!=0].dropna()
    except:
        return None
   ....:     

Groupby the 'level' on the columns and apply f; don't use apply directly, but just concat the results as rows (this is the 'unstacking' part).

However this creates dups (on the price level), so need to aggregate them.

In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]: 
value                                 quantity
                        side price            
2014-05-13 08:47:17.003 ask  102.303  -1000000
                             102.305   1000000
                        bid  102.295  -1000000
                             102.297   1000000
2014-05-13 08:47:17.005 ask  102.307   -800000
                             102.308    800000
                        bid  102.288  -5000000
                             102.291   3000000
                             102.292  -4000000
                             102.293  -6800000
                             102.294   1000000
                             102.295   2300000
                             102.296   -500000
                             102.297   1000000
                             102.298   2000000
2014-05-13 08:47:17.006 ask  102.304  -3000000
                             102.305   3000000
                             102.308  -5000000
                             102.309   5000000
                             102.310         0
                             102.312         0
                             102.313         0
                             102.315         0
                             102.316         0
                             102.320         0
                             102.321         0
                             102.324         0
                             102.325         0
                        bid  102.265  -9500000
                             102.275         0
                             102.276         0
                             102.280         0
                             102.284         0
                             102.286         0
                             102.287         0
                             102.288         0
                             102.289         0
                             102.290         0
                             102.291         0
                             102.292         0
                             102.293  -2000000
                             102.294         0
                             102.295  -4000000
                             102.296  -1000000
                             102.297   3000000
                             102.298         0
                             102.299   1000000
2014-05-13 08:47:17.007 ask  102.303  -1000000
                             102.304   2000000
                             102.307  -1000000
                             102.308   1000000
                        bid  102.293  -1000000
                             102.294   1000000
                             102.295  -1000000
                             102.296  -1000000
                             102.297   2000000

Timings (I think that optimizing f will make this quite a bit faster)

In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop

In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop

Original method

In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop

In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM