简体   繁体   English

加快在大 pandas dataframe 中搜索最近的上限值和下限值

[英]Speed up search for nearest upper and lower value in large pandas dataframe

My dataframe looks similar to this example below (just with way more entries).我的 dataframe 看起来类似于下面的这个例子(只是有更多的条目)。 I want to obtain the nearest upper and lower number for a given value, for each group.我想为每个组获取给定值的最接近的上限和下限。

a    b  
600  10
600  12
600  15
600  17
700   8
700  11
700  19

For example for a value of 13. I would like to obtain a new dataframe similar to:例如值为 13。我想获得一个新的 dataframe 类似于:

a    b  
600  12
600  15
700  11
700  19

I already tried the solution from Ivo Merchiers in How do I find the closest values in a Pandas series to an input number?我已经在如何找到 Pandas 系列中与输入数字最接近的值中尝试了 Ivo Merchiers 的解决方案? using groupby and apply to run it for the different groups.使用 groupby 并申请为不同的组运行它。

def find_neighbours(value):
  exactmatch=df[df.num==value]
  if !exactmatch.empty:
      return exactmatch.index
  else:
      lowerneighbour_ind = df[df.num<value].num.idxmax()
      upperneighbour_ind = df[df.num>value].num.idxmin()
      return [lowerneighbour_ind, upperneighbour_ind]

df=df.groupby('a').apply(find_neighbours, 13)

But since my dataset has around 16 million lines this procedure takes extremely long.但是由于我的数据集大约有 1600 万行,所以这个过程需要很长时间。 Is there possibly a faster way to obtain a solution?是否有更快的方法来获得解决方案?

Edit Thanks for your answers.编辑感谢您的回答。 I forgot to add some info.我忘了添加一些信息。 If a close number appears multiple times I would like to have all lines transfered to the new dataframe.如果多次出现关闭数字,我希望将所有行转移到新的 dataframe。 And when there is only one upper (lower) and no lower (upper) neighbour, this lines should be ignored.而当只有一个上(下)邻居而没有下(上)邻居时,这条线应该被忽略。

a    b  
600  10
600  12
600  15
600  17
700   8
700  11
700  19
800  14
800  15
900  12
900  14
900  14

Leads for 13 to this:导致 13 到此:

a    b  
600  12
600  15
700  11
700  19
900  12
900  14
900  14

Thanks for your help!谢谢你的帮助!

Yes we can speed it up是的,我们可以加快速度

v=13

s=(df.b-v)
t=s.abs().groupby([df.a,np.sign(s)]).transform('min')
df1=df.loc[s.abs()==t]
df1=df1[df1.b.sub(v).groupby(df.a).transform('nunique')>1]
df1
Out[102]: 
      a   b
1   600  12
2   600  15
5   700  11
6   700  19
9   900  12
10  900  14
11  900  14

try this尝试这个

def neighbours(x):
    d = (df.b-x)
    return df.loc[[d[d==d[d>0].min()].index[0], d[d==d[d<0].max()].index[0]]]
neighbours(13)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM