[英]Speed up search for nearest upper and lower value in large pandas dataframe
My dataframe looks similar to this example below (just with way more entries).我的 dataframe 看起来类似于下面的这个例子(只是有更多的条目)。 I want to obtain the nearest upper and lower number for a given value, for each group.我想为每个组获取给定值的最接近的上限和下限。
a b
600 10
600 12
600 15
600 17
700 8
700 11
700 19
For example for a value of 13. I would like to obtain a new dataframe similar to:例如值为 13。我想获得一个新的 dataframe 类似于:
a b
600 12
600 15
700 11
700 19
I already tried the solution from Ivo Merchiers in How do I find the closest values in a Pandas series to an input number?我已经在如何找到 Pandas 系列中与输入数字最接近的值中尝试了 Ivo Merchiers 的解决方案? using groupby and apply to run it for the different groups.使用 groupby 并申请为不同的组运行它。
def find_neighbours(value):
exactmatch=df[df.num==value]
if !exactmatch.empty:
return exactmatch.index
else:
lowerneighbour_ind = df[df.num<value].num.idxmax()
upperneighbour_ind = df[df.num>value].num.idxmin()
return [lowerneighbour_ind, upperneighbour_ind]
df=df.groupby('a').apply(find_neighbours, 13)
But since my dataset has around 16 million lines this procedure takes extremely long.但是由于我的数据集大约有 1600 万行,所以这个过程需要很长时间。 Is there possibly a faster way to obtain a solution?是否有更快的方法来获得解决方案?
Edit Thanks for your answers.编辑感谢您的回答。 I forgot to add some info.我忘了添加一些信息。 If a close number appears multiple times I would like to have all lines transfered to the new dataframe.如果多次出现关闭数字,我希望将所有行转移到新的 dataframe。 And when there is only one upper (lower) and no lower (upper) neighbour, this lines should be ignored.而当只有一个上(下)邻居而没有下(上)邻居时,这条线应该被忽略。
a b
600 10
600 12
600 15
600 17
700 8
700 11
700 19
800 14
800 15
900 12
900 14
900 14
Leads for 13 to this:导致 13 到此:
a b
600 12
600 15
700 11
700 19
900 12
900 14
900 14
Thanks for your help!谢谢你的帮助!
Yes we can speed it up是的,我们可以加快速度
v=13
s=(df.b-v)
t=s.abs().groupby([df.a,np.sign(s)]).transform('min')
df1=df.loc[s.abs()==t]
df1=df1[df1.b.sub(v).groupby(df.a).transform('nunique')>1]
df1
Out[102]:
a b
1 600 12
2 600 15
5 700 11
6 700 19
9 900 12
10 900 14
11 900 14
try this尝试这个
def neighbours(x):
d = (df.b-x)
return df.loc[[d[d==d[d>0].min()].index[0], d[d==d[d<0].max()].index[0]]]
neighbours(13)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.