[英]Pandas: Find closest group from another dataframe
Below, I have two dataframe.下面,我有两个数据框。 I need to update df_mapped using df_original.
我需要使用 df_original 更新 df_mapped。 In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
在 df_mapped 中,对于每个 x_time 需要找到 3 个最接近的行(根据与 x_price 的差异定义的最接近行)并将它们添加到 df_mapped 数据帧。
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time ** **中间输出:在此,需要为每个 x_time 选择 3 min x_price_delta 行 **
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row ** **最后一步:保持 x_time 固定需要展平数据帧,以便我们在一行中获得最近的 3 行 **
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step.我被困在中间和最后一步之间。 Can't think of way out, what could be done to massage the dataframe?
想不出出路,可以做些什么来按摩数据框?
This is not complete solution but it might help you to get unstuck.这不是完整的解决方案,但它可能会帮助您摆脱困境。
At the end we get the correct data.最后我们得到正确的数据。
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.我相信有更好的方法来处理这种情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.