简体   繁体   English

通过min-max kwargs过滤熊猫数据框

[英]Filtering pandas dataframe by min-max kwargs

I have a function that has optional kwargs (8 total) based off min and max values entered by the user. 我有一个函数,该函数具有基于用户输入的最小值和最大值的可选kwargs(共8个)。

For example GR_min, GR_max, GR_N_min, GR_N_max, Hi_min, Hi_max etc...where the dataframe columns are GR , GR_N , Hi etc... 例如GR_min, GR_max, GR_N_min, GR_N_max, Hi_min, Hi_max等...其中数据帧GR_N GRGR_NHi等...

I'd like the dataframe to filter on the given min and max values but if one or more of the values are not given in the call of the function to set the default min-max values to just the min-max of the columns. 我希望数据框根据给定的最小值和最大值进行过滤,但是如果在函数调用中未给出一个或多个值以将默认的最小值-最大值设置为仅列的最小值-最大值,则该数据框将进行过滤。

For example some pseudo code: 例如一些伪代码:

df = pd.DataFrame({'GR': [1, 2, 3, 4, 2, 3], 
'GR_N': [0.8, 0, 1, 0.6, 0.9, 1], 'Hi':[3, 6, 2, 5, 22, 7]})

Gets me: 让我:

    GR  GR_N    Hi
0   1   0.8     3
1   2   0.0     6
2   3   1.0     2
3   4   0.6     5
4   2   0.9     22
5   3   1.0     7

I want a function that does something like this: 我想要一个执行以下操作的函数:

def picker(data, **kwargs):

      data_filtered = data[data['GR'].between(GR_min, GR_max) &
                         data['GR_N'].between(GR_N_min, GR_N_max) &
                         data['Hi'].between(Hi_min, Hi_max)]

      return data_filtered

With an output after calling to be: 调用后的输出为:

picker(data=df, GR_min=2, GR_max=3, Hi_min=1, Hi_max=6)

    GR  GR_N    Hi
1   2   0.0     6
2   3   1.0     2

Except instead of explicitly calling each column of the dataframe we use the **kwargs themselves to filter on. 除了不显式调用数据框的每一列外,我们使用** kwargs自身进行过滤。

Is there any way to do this? 有什么办法吗?

DataFrame.query can be handy here, because it will parse a string containing conditions. DataFrame.query在这里可以很方便,因为它将解析包含条件的字符串。 So it will be enough to build a condition string from the keyword parameters. 因此,从关键字参数构建条件字符串就足够了。

Each individual condition could be built as: K<=val for a K_max=val parameter, and K>=val for a K_min=val parameter. 每个单独的条件可被构建为: K<=val用于K_max=val参数,以及K>=val用于K_min=val参数。 To build the list, each individual condition must be enclosed in parentheses ( () ) and then joined with & . 要构建列表,必须将每个单独的条件括在括号( () )中,然后与&联接。

Code could be: 代码可以是:

def picker(data, **kwargs):
    def make_cond(k,v):
        if len(k)<5:
            raise(ValueError('Arg too short {}'.format(k)))
        if k.endswith('_min'):
            return '({}>={})'.format(k[:-4], v)
        elif k.endswith('_max'):
            return '({}<={})'.format(k[:-4], v)
        else:
            raise(ValueError('Unknow arg {}'.format(k)))
    strcond='&'.join((make_cond(k, v) for k,v in kwargs.items()))
    # print(strcond)     # uncomment for traces
    return data.query(strcond)

You could have a default dictionary for your kwargs specifying the min and max as -infinity and +infinity, and then just over-ride these with the user input. 您可以为kwargs使用默认字典,将min和max指定为-infinity和+ infinity,然后使用用户输入覆盖它们。 Something like this: 像这样:

import numpy as np
def picker(data, **kwargs):
    d = dict(GR_min=-np.inf, GR_max=np.inf) # ... etc
    kwargs = {**d, **kwargs}
    data_filtered = data[data['GR'].between(kwargs["GR_min"], kwargs["GR_max"])] # ... etc
    return data_filtered

I'm a bit perplexed by this, filtering based on the min-max values in the columns would just be not filtering at all, no? 我对此感到有些困惑,基于列中的最小-最大值进行过滤根本就不会进行过滤,不是吗? Why not just only filter based on the arguments provided? 为什么不仅仅根据提供的参数进行过滤? Regardless, this sounds like a case for default arguments. 无论如何,这听起来像是默认参数的情况。

#create the DataFrame
df = pd.DataFrame({'GR': [1, 2, 3, 4, 2, 3], 
'GR_N': [0.8, 0, 1, 0.6, 0.9, 1], 'Hi':[3, 6, 2, 5, 22, 7]})

def picker(df, GR_min = None, GR_max = None, GR_N_min = None, GR_N_max = None,
           Hi_min = None, Hi_max = None): #use default arguments

           if GR_min == None:
               GR_min = df['GR'].min()
           if GR_max == None:
               GR_max = df['GR'].max()
           if GR_N_min == None:
               GR_N_min = df['GR_N'].min()
           if GR_N_max == None:
               GR_N_max == df['GR_N'].max()

           #filter the DataFrame with masks
           df_out = df.loc[(df['GR'] > GR_min) & (df['GR'] < GR_max) & 
                           (df['GR_N'] > GR_N_min) & (df['GR_N'] < GR_N_max)]
           return df_out

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM