简体   繁体   English

在 Pandas DataFrame 的不同列中查找最接近的前一个值

[英]Find closest preceding value in different columns of Pandas DataFrame

I am trying to find a way to find a matching value, given a specific column value, in the nearest preceding rows of two separate columns of a Pandas Dataframe, and subsequently indicate '1' if found in the column else '0'.我试图找到一种方法,在给定特定列值的情况下,在 Pandas Dataframe 的两个单独列的最近行中找到匹配值,如果在其他列中找到,则随后指示“1”“0”。

The Dataframe index is not sorted. Dataframe 索引未排序。

Data:数据:

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021031', '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021036'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': ['NaN', 7026.5, 7026.5, np.NaN, np.NaN, 7024.0], 
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5], 
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5]})

What I need:我需要的:

When the type == 'Trade' I need to look back through the bid_price and ask_price , and find the first value that matches the column price .type == 'Trade' 时,我需要回顾bid_priceask_price ,并找到与price列匹配的第一个值。 In the same row as the one with the trade I want two separate columns indicating whether the price was found in the nearest bid_price or ask_price columns.在与交易行相同的行中,我想要两个单独的列来指示价格是否在最近的bid_priceask_price列中找到。

Expected Output :预期 Output

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021034', '2020-11-16 01:39:06.22021033'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': ['NaN', 7026.5, 7026.5, np.NaN, np.NaN, 7024.0], 
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5], 
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5],
  'is_bid_trade': [0, 0, 0, 0, 0, 1],
  'is_ask_trade': [1, 1, 0, 0, 0, 0]})

You can see that the first trade matches the quote in the preceding row in the ask_price column.您可以看到第一笔交易与ask_price列中前一行的报价相匹配。 The final trade matches in the bid_price column, but this is two rows behind the trade.最终交易在bid_price列中匹配,但在交易后面两行。

I have tried (and have been kindly helped by SO) but have yet to find a solution here.我已经尝试过(并且得到了 SO 的帮助),但还没有在这里找到解决方案。

The datetime column is sadly not 100% accurate, so cannot be relied upon to sort chronologically.遗憾的是, datetime时间列并非 100% 准确,因此不能依赖按时间顺序排序。 I have also attempted to find the minimum index using df.index.get_loc(), but am unsure of how to apply this to two columns to search within.我还尝试使用 df.index.get_loc() 找到最小索引,但不确定如何将其应用于两列进行搜索。

All help very gratefully received.非常感谢所有帮助。

Here ya go.这里是 go。 Note, in your input dataset I changed a string 'NaN' to np.nan to be consistent, and I think your output dataset had a misplaced 1. It's inconsistent as to whether the 1 should go where the trade occurred or on the preceding row.请注意,在您的输入数据集中,我将字符串 'NaN' 更改为 np.nan 以保持一致,并且我认为您的 output 数据集有一个错位的 1。关于 1 是否应该 go 发生交易的地方或前一行是不一致的. nonetheless, i think this works the way you want with data provided.尽管如此,我认为这可以按照您提供的数据的方式进行。 See comments in the code.请参阅代码中的注释。 If the 1s are supposed to be at the trade row, you can modify the indexing to get the right row.如果 1 应该在交易行,您可以修改索引以获得正确的行。

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021031', '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021036'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': [np.NaN, 7026.5, 7026.5, np.NaN, np.NaN, 7024.0],
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5],
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5]})
# you don't have to sort, but reset the index
df.reset_index(drop=True, inplace=True)

# collect the indices where Trade occurred
trade_indices = df.loc[df['type'] == 'Trade'].index.tolist()
# collect corresponding trade price
prices = df['price'].loc[df['price'].notnull()].tolist()
# create a tuple to match the trade row and price
test_tuples = list(zip(trade_indices, prices))
print(test_tuples)
dfo = df # create an output dataframe leaving input df as-is
dfo[['is_bid_trade', 'is_ask_trade']] = 0 # create your new columns with zeroes

# iterate over tuples; this will take full range from 0 up to the row the trade occurred; look for price in either ask or bid price columns, then take the last row (tail(1)). 
# tail(1) will be your most recent row to the trade
for (tradei, price) in test_tuples:
    print(tradei, price)
    # print(df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)])
    # print(df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)].tail(1))
    dftemp = df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)].tail(1)
    # print(dftemp)
    if dftemp.iat[0,3] == price:
        # test if in ask or bid then write to dfo
        dfindex = dftemp.index[0]
        #dfo.at[dfindex, 'is_ask_trade'] = 1
        dfo.at[tradei, 'is_ask_trade'] = 1
    else:
        dfindex = dftemp.index[0]
        #dfo.at[dfindex, 'is_bid_trade'] = 1
        dfo.at[tradei, 'is_ask_trade'] = 1

Output: Output:

In [4]: dfo
Out[4]:
datetime                     type  price  ask_price bid_price is_bid_trade is_ask_trade
2020-11-16 01:39:06.22021017 Quote NaN    7026.5    7024.0    0    0
2020-11-16 01:39:06.22021020 Trade 7026.5 7026.5    7024.5    0    1
2020-11-16 01:39:06.22021022 Trade 7026.5 7026.0    7024.5    0    1
2020-11-16 01:39:06.22021031 Quote NaN    7026.5    7024.0    0    0
2020-11-16 01:39:06.22021033 Quote NaN    7026.0    7024.5    0    0
2020-11-16 01:39:06.22021036 Trade 7024.0 7026.5    7024.5    0    1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM