I have two data frames that look as follows:
dataOB = pd.DataFrame({'Time': \
[dt.datetime(2013,4,17,9,0,1), \
dt.datetime(2013,4,17,9,0,1), \
dt.datetime(2013,4,17,9,0,2), \
dt.datetime(2013,4,17,9,0,2), \
dt.datetime(2013,4,17,9,0,2), \
dt.datetime(2013,4,17,9,0,2), \
dt.datetime(2013,4,17,9,0,3), \
dt.datetime(2013,4,17,9,0,3)], \
'hsec': [2,54,0,42,60,89,0,10], 'val': [4,5,5,3,2,4,4,7]})
and
dfEq = pd.DataFrame({'Time': [dt.datetime(2013,4,17,9,0,1), \
dt.datetime(2013,4,17,9,0,1), \
dt.datetime(2013,4,17,9,0,1), \
dt.datetime(2013,4,17,9,0,2), \
dt.datetime(2013,4,17,9,0,2), \
dt.datetime(2013,4,17,9,0,3), \
dt.datetime(2013,4,17,9,0,3), \
dt.datetime(2013,4,17,9,0,3), \
dt.datetime(2013,4,17,9,0,3)], \
'price': [4,4,5,3,3,4,5,4,5], \
'flag': ['K','V','V','V','K','K','V','K','V']})
I need to assign to each row in dfEq a value that depends on whether or not the price in that row is present in the values of 'val' in dataOB at the same timestamp.
A first solution of mine looks as follows and gives me the desired result. (The 'however' follows below.)
dataOB.set_index('Time', inplace=True)
dfEq.set_index('Time', inplace=True)
dfEq['type'] = np.zeros(len(dfEq.index))
tmpOB = pd.DataFrame([dataOB.ix[trTime,'val'] for trTime in dfEq.index], \
index = dfEq.index)
>>> tmpOB
0 1 2 3
Time
2013-04-17 09:00:01 4 5 NaN NaN
2013-04-17 09:00:01 4 5 NaN NaN
2013-04-17 09:00:01 4 5 NaN NaN
2013-04-17 09:00:02 5 3 2 4
2013-04-17 09:00:02 5 3 2 4
2013-04-17 09:00:03 4 7 NaN NaN
2013-04-17 09:00:03 4 7 NaN NaN
2013-04-17 09:00:03 4 7 NaN NaN
2013-04-17 09:00:03 4 7 NaN NaN
[9 rows x 4 columns]
dfEq.type[tmpOB.eq(dfEq.price,axis=0).any(axis=1) & (dfEq.flag=='K')] = 'MBO'
dfEq.type[tmpOB.eq(dfEq.price,axis=0).any(axis=1) & (dfEq.flag=='V')] = 'LSO'
>>> dfEq
price flag type
Time
2013-04-17 09:00:01 4 K MBO
2013-04-17 09:00:01 4 V LSO
2013-04-17 09:00:01 5 V LSO
2013-04-17 09:00:02 3 V LSO
2013-04-17 09:00:02 3 K MBO
2013-04-17 09:00:03 4 K MBO
2013-04-17 09:00:03 5 V 0
2013-04-17 09:00:03 4 K MBO
2013-04-17 09:00:03 5 V 0
[9 rows x 3 columns]
The problem here is that I have a lot of such data frames and that all of them are rather large such that the creation of tmpOB is not feasible from both aspects of memory and computation time due to the list comprehension.
SO MY QUESTION IS: is there a way to achive the same result without the need of a list comprehension or a loop? maybe there is a more direct way to compare the price in each row with the contemporaneous elements in 'val'?
(I also tried to use pd.merge() (before setting the index in both data frames) like
mergedDf = pd.merge(dfEq,dataOB,on='Time')
mergedDf['type'] = np.zeros(len(mergedDf.index))
mergedDf.type[(mergedDf.price==mergedDf.val) & \
(mergedDf.flag=='K')] = 'MBO'
mergedDf.type[(mergedDf.price==mergedDf.val) & \
(mergedDf.flag=='V')] = 'LSO'
But then I wouldn't know how to get rid again of the unnecessary rows.)
I discovered that I can use pandas' unstack() to create tmpOB without a loop, which makes the code much faster.
First I have to index dataOB by a Multi-index to get
val
Time hsec
2013-04-17 09:00:01 0 4
1 5
2013-04-17 09:00:02 0 5
1 3
2 2
3 4
2013-04-17 09:00:03 0 4
1 7
(getting the 'hsec'-level index into this form requires some manipulation, see pandas - change values of second level index to display position within first level index )
Then, tmpOB is obtained by
dataOB.unstack('hsec')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.