简体   繁体   English

熊猫:从另一个数据框中找到最接近的组

[英]Pandas: Find closest group from another dataframe

Below, I have two dataframe.下面,我有两个数据框。 I need to update df_mapped using df_original.我需要使用 df_original 更新 df_mapped。 In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.在 df_mapped 中,对于每个 x_time 需要找到 3 个最接近的行(根据与 x_price 的差异定义的最接近行)并将它们添加到 df_mapped 数据帧。

import io
import pandas as pd

d = """
x_time    expiration    x_price    p_price
 60          4           10                  20
 60          5           11                  30
 60          6           12                  40
 60          7           13                  50
 60          8           14                  60
 70          5           10                  20
 70          6           11                  30
 70          7           12                  40
 70          8           13                  50
 70          9           14                  60
 80          1           10                  20
 80          2           11                  30
 80          3           12                  40
 80          4           13                  50
 80          5           14                  60
"""

df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`

to_mapped = """
x_time    expiration    x_price
 50          4          15
 60          5          15
 70          6          13
 80          7          20
 90          8          20
"""

df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)

df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`

**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time ** **中间输出:在此,需要为每个 x_time 选择 3 min x_price_delta 行 **

int_out = """    
x_time  expiration_x    x_price_x   expiration_y    x_price_y   p_price x_price_delta
50  4   15              
60  5   15  6   12  40  3
60  5   15  7   13  50  2
60  5   15  8   14  60  1
70  6   13  7   12  40  1
70  6   13  8   13  50  0
70  6   13  9   14  60  1
80  7   20  3   12  40  8
80  7   20  4   13  50  7
80  7   20  5   14  60  6
90  8   20              
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)

**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row ** **最后一步:保持 x_time 固定需要展平数据帧,以便我们在一行中获得最近的 3 行 **

final_out = """
x_time  expiration_original x_price_original    expiration_1    x_price_1   p_price_1   expiration_2    x_price_2   p_price_2   expiration_3    x_price_3   p_price_3
50  4   15                                  
60  5   15  6   12  40  7   13  50  8   14  60
70  6   13  7   12  40  8   13  50  9   14  60
80  7   20  3   12  40  4   13  50  5   14  60
90  8   20                                  
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)

I am stuck in between intermediate and last step.我被困在中间和最后一步之间。 Can't think of way out, what could be done to massage the dataframe?想不出出路,可以做些什么来按摩数据框?

This is not complete solution but it might help you to get unstuck.这不是完整的解决方案,但它可能会帮助您摆脱困境。

At the end we get the correct data.最后我们得到正确的数据。

In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
     ...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
     ...: ).drop(["x_price_delta", "x_price_x"],axis=1)

In [2]: df1 = df.iloc[1:-1]

In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
     ...:  d in x.values],axis=1).unstack())
Out[3]:
           0
           0     1     2    0     1     2    0     1     2
(60, 5)  6.0  12.0  40.0  7.0  13.0  50.0  8.0  14.0  60.0
(70, 6)  7.0  12.0  40.0  9.0  14.0  60.0  8.0  13.0  50.0
(80, 7)  3.0  12.0  40.0  4.0  13.0  50.0  5.0  14.0  60.0

I am sure there are much better ways of handling this case.我相信有更好的方法来处理这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM