简体   繁体   中英

compare column values of dataframes with non-unique indices of different length

I have two data frames that look as follows:

dataOB = pd.DataFrame({'Time': \
                 [dt.datetime(2013,4,17,9,0,1), \
                  dt.datetime(2013,4,17,9,0,1), \
                  dt.datetime(2013,4,17,9,0,2), \
                  dt.datetime(2013,4,17,9,0,2), \
                  dt.datetime(2013,4,17,9,0,2), \
                  dt.datetime(2013,4,17,9,0,2), \
                  dt.datetime(2013,4,17,9,0,3), \
                  dt.datetime(2013,4,17,9,0,3)], \
                 'hsec': [2,54,0,42,60,89,0,10], 'val': [4,5,5,3,2,4,4,7]})

and

dfEq = pd.DataFrame({'Time': [dt.datetime(2013,4,17,9,0,1), \
                          dt.datetime(2013,4,17,9,0,1), \
                          dt.datetime(2013,4,17,9,0,1), \
                          dt.datetime(2013,4,17,9,0,2), \
                          dt.datetime(2013,4,17,9,0,2), \
                          dt.datetime(2013,4,17,9,0,3), \
                          dt.datetime(2013,4,17,9,0,3), \
                          dt.datetime(2013,4,17,9,0,3), \
                          dt.datetime(2013,4,17,9,0,3)], \
                 'price': [4,4,5,3,3,4,5,4,5], \
                 'flag': ['K','V','V','V','K','K','V','K','V']})

I need to assign to each row in dfEq a value that depends on whether or not the price in that row is present in the values of 'val' in dataOB at the same timestamp.

A first solution of mine looks as follows and gives me the desired result. (The 'however' follows below.)

dataOB.set_index('Time', inplace=True)
dfEq.set_index('Time', inplace=True)

dfEq['type'] = np.zeros(len(dfEq.index))

tmpOB = pd.DataFrame([dataOB.ix[trTime,'val'] for trTime in dfEq.index], \
index = dfEq.index)
>>> tmpOB
                     0  1   2   3
Time                             
2013-04-17 09:00:01  4  5 NaN NaN
2013-04-17 09:00:01  4  5 NaN NaN
2013-04-17 09:00:01  4  5 NaN NaN
2013-04-17 09:00:02  5  3   2   4
2013-04-17 09:00:02  5  3   2   4
2013-04-17 09:00:03  4  7 NaN NaN
2013-04-17 09:00:03  4  7 NaN NaN
2013-04-17 09:00:03  4  7 NaN NaN
2013-04-17 09:00:03  4  7 NaN NaN

[9 rows x 4 columns]

dfEq.type[tmpOB.eq(dfEq.price,axis=0).any(axis=1) & (dfEq.flag=='K')] = 'MBO'
dfEq.type[tmpOB.eq(dfEq.price,axis=0).any(axis=1) & (dfEq.flag=='V')] = 'LSO'

>>> dfEq
                     price  flag type
Time                                 
2013-04-17 09:00:01      4     K  MBO
2013-04-17 09:00:01      4     V  LSO
2013-04-17 09:00:01      5     V  LSO
2013-04-17 09:00:02      3     V  LSO
2013-04-17 09:00:02      3     K  MBO
2013-04-17 09:00:03      4     K  MBO
2013-04-17 09:00:03      5     V    0
2013-04-17 09:00:03      4     K  MBO
2013-04-17 09:00:03      5     V    0

[9 rows x 3 columns]

The problem here is that I have a lot of such data frames and that all of them are rather large such that the creation of tmpOB is not feasible from both aspects of memory and computation time due to the list comprehension.

SO MY QUESTION IS: is there a way to achive the same result without the need of a list comprehension or a loop? maybe there is a more direct way to compare the price in each row with the contemporaneous elements in 'val'?

(I also tried to use pd.merge() (before setting the index in both data frames) like

mergedDf = pd.merge(dfEq,dataOB,on='Time')

mergedDf['type'] = np.zeros(len(mergedDf.index))

mergedDf.type[(mergedDf.price==mergedDf.val) & \
              (mergedDf.flag=='K')] = 'MBO'
mergedDf.type[(mergedDf.price==mergedDf.val) & \
              (mergedDf.flag=='V')] = 'LSO'

But then I wouldn't know how to get rid again of the unnecessary rows.)

I discovered that I can use pandas' unstack() to create tmpOB without a loop, which makes the code much faster.

First I have to index dataOB by a Multi-index to get

                          val
Time                hsec     
2013-04-17 09:00:01 0       4
                    1       5
2013-04-17 09:00:02 0       5
                    1       3  
                    2       2
                    3       4
2013-04-17 09:00:03 0       4
                    1       7

(getting the 'hsec'-level index into this form requires some manipulation, see pandas - change values of second level index to display position within first level index )

Then, tmpOB is obtained by

dataOB.unstack('hsec') 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM