简体   繁体   English

根据存储在字典中的标准从 Pandas 数据框中选择数据

[英]Selecting data from Pandas dataframe based on criteria stored in a dict

I have a Pandas dataframe that contains a large number of variables.我有一个包含大量变量的 Pandas 数据框。 This can be simplified as:这可以简化为:

tempDF = pd.DataFrame({ 'var1': [12,12,12,12,45,45,45,51,51,51],
                        'var2': ['a','a','b','b','b','b','b','c','c','d'],
                        'var3': ['e','f','f','f','f','g','g','g','g','g'],
                        'var4': [1,2,3,3,4,5,6,6,6,7]})

If I wanted to select a subset of the dataframe (eg var2='b' and var4=3), I would use:如果我想选择数据框的一个子集(例如 var2='b' 和 var4=3),我会使用:

tempDF.loc[(tempDF['var2']=='b') & (tempDF['var4']==3),:]

However, is it possible to select a subset of the dataframe if the matching criteria are stored within a dict, such as:但是,如果匹配条件存储在 dict 中,是否可以选择数据帧的子集,例如:

tempDict = {'var2': 'b','var4': 3}

It's important that the variable names are not predefined and the number of variables included in the dict is changeable.重要的是变量名称不是预定义的,并且字典中包含的变量数量是可变的。

I've been puzzling over this for a while and so any suggestions would be greatly appreciated.我一直对此感到困惑,所以任何建议都将不胜感激。

You can evaluate a series of conditions.您可以评估一系列条件。 They don't have to be just an equality.他们不必只是一个平等。

df = tempDF
d = tempDict

# `repr` returns the string representation of an object.    
>>> df[eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond)) 
       for col, cond in d.iteritems()]))]
   var1 var2 var3  var4
2    12    b    f     3
3    12    b    f     3

Looking at what eval does here:看看eval在这里做了什么:

conditions = " & ".join(["(df['{0}'] == {1})".format(col, repr(cond)) 
       for col, cond in d.iteritems()])

>>> conditions
"(df['var4'] == 3) & (df['var2'] == 'b')"

>>> eval(conditions)
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Here is another example using an equality constraint:这是另一个使用等式约束的示例:

>>> eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond)) 
                      for col, cond in d.iteritems()]))
d = {'var2': ('==', "'b'"),
     'var4': ('>', 3)}

>>> df[eval(" & ".join(["(df['{0}'] {1} {2})".format(col, cond[0], cond[1]) 
       for col, cond in d.iteritems()]))]
   var1 var2 var3  var4
4    45    b    f     4
5    45    b    g     5
6    45    b    g     6

Another alternative is to use query :另一种选择是使用query

qry = " & ".join('{0} {1} {2}'.format(k, cond[0], cond[1]) for k, cond in d.iteritems())

>>> qry
"var4 > 3 & var2 == 'b'"

>>> df.query(qry)
   var1 var2 var3  var4
4    45    b    f     4
5    45    b    g     5
6    45    b    g     6

You could create mask for each condition using list comprehension and then join them by converting to dataframe and using all :您可以使用列表理解为每个条件创建掩码,然后通过转换为数据框并使用all来加入它们:

In [23]: pd.DataFrame([tempDF[key] == val for key, val in tempDict.items()]).T.all(axis=1)
Out[23]:
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Then you could slice your dataframe with that mask:然后你可以用那个掩码切片你的数据框:

mask = pd.DataFrame([tempDF[key] == val for key, val in tempDict.items()]).T.all(axis=1)

In [25]: tempDF[mask]
Out[25]:
   var1 var2 var3  var4
2    12    b    f     3
3    12    b    f     3

Here's one way to build up conditions from tempDict这是从tempDict建立条件的一种方法

In [25]: tempDF.loc[pd.np.all([tempDF[k] == tempDict[k] for k in tempDict], axis=0), :]
Out[25]:
   var1 var2 var3  var4
2    12    b    f     3
3    12    b    f     3

Or use query for more readable query-like string.或者使用query来获得更易读的类似查询的字符串。

In [33]: tempDF.query(' & '.join(['{0}=={1}'.format(k, repr(v)) for k, v in tempDict.iteritems()]))
Out[33]:
   var1 var2 var3  var4
2    12    b    f     3
3    12    b    f     3

In [34]: ' & '.join(['{0}=={1}'.format(k, repr(v)) for k, v in tempDict.iteritems()])
Out[34]: "var4==3 & var2=='b'"

Here's a function I have in my personal utils which accepts single values or lists to subset on:这是我个人实用程序中的一个函数,它接受单个值或列表作为子集:

def subsetdict(df, sdict):
    subsetter_list = [df[i].isin([j]) if not isinstance(j, list) else df[i].isin(j) for i, j in sdict.items()]
    subsetter = pd.concat(subsetter_list, axis=1).all(1)
    return df.loc[subsetter, :]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM