[英]Pandas: DataFrame filtering using groupby and a function
Using Python 3.3 and Pandas 0.10 使用Python 3.3和Pandas 0.10
I have a DataFrame that is built from concatenating multiple CSV files. 我有一个通过连接多个CSV文件构建的DataFrame。 First, I filter out all values in the Name column that contain a certain string.
首先,我过滤掉Name列中包含特定字符串的所有值。 The result looks something like this (shortened for brevity sakes, actually there are more columns):
结果看起来像这样(缩短为简洁sakes,实际上有更多列):
Name ID
'A' 1
'B' 2
'C' 3
'C' 3
'E' 4
'F' 4
... ...
Now my issue is that I want to remove a special case of 'duplicate' values. 现在我的问题是我想删除一个特殊的'重复'值。 I want to remove all ID duplicates (entire row actually) where the corresponding Name values that are mapped to this ID are not similar.
我想删除所有ID重复项(实际上是整行),其中映射到此ID的相应Name值不相似。 In the example above I would like to keep rows with ID 1, 2 and 3. Where ID=4 the Name values are unequal and I want to remove those.
在上面的示例中,我想保留ID为1,2和3的行。其中ID = 4,Name值不相等,我想删除它们。
I tried to use the following line of code (based on the suggestion here: Python Pandas: remove entries based on the number of occurrences ). 我尝试使用以下代码行(基于此处的建议: Python Pandas:根据出现次数删除条目 )。
Code: 码:
df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]
However that gives me the error: ValueError: Item wrong length 51906 instead of 109565!
然而,这给了我错误:
ValueError: Item wrong length 51906 instead of 109565!
Edit: 编辑:
Instead of using apply()
I have also tried using transform()
, however that gives me the error: AttributeError: 'int' object has no attribute 'ndim'
. 我没有使用
apply()
而是尝试使用transform()
,但是这给了我错误: AttributeError: 'int' object has no attribute 'ndim'
。 An explanation on why the error is different per function is very much appreciated! 非常感谢每个功能错误原因不同的解释!
Also, I want to keep keep all rows where ID = 3 in the above example. 此外,我想在上面的例子中保持ID = 3的所有行。
Thanks in advance, Matthijs 提前谢谢,Matthijs
Instead of length len
, I think you want to consider the number of unique values of Name in each group. 而不是长度
len
,我想你想要考虑每个组中Name的唯一值的数量。 Use nunique()
, and check out this neat recipe for filtering groups. 使用
nunique()
,并查看这个整齐的配方过滤组。
df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]
If you upgrade to pandas 0.12, you can use the new filter
method on groups, which makes this more succinct and straightforward. 如果升级到pandas 0.12,则可以在组上使用新的
filter
方法,这使得它更加简洁明了。
df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)
A general remark: Sometimes, of course, you do want to know the length of the group, but I find that size
is a safer choice than len
, which has been troublesome for me in some cases. 一般说法:当然,有时候,你确实想知道小组的长度,但我发现
size
比len
更安全,在某些情况下这对我来说很麻烦。
You could first drop the duplicates: 您可以先删除重复项:
In [11]: df = df.drop_duplicates()
In [12]: df
Out[12]:
Name ID
0 A 1
1 B 2
2 C 3
4 E 4
5 F 4
The groupby
id and only consider those with one element: groupby
id并且仅考虑具有一个元素的那些:
In [13]: g = df.groupby('ID')
In [14]: size = (g.size() == 1)
In [15]: size
Out[15]:
ID
1 True
2 True
3 True
4 False
dtype: bool
In [16]: size[size].index
Out[16]: Int64Index([1, 2, 3], dtype=int64)
In [17]: df['ID'].isin(size[size].index)
Out[17]:
0 True
1 True
2 True
4 False
5 False
Name: ID, dtype: bool
And boolean index by this: 和布尔索引由此:
In [18]: df[df['ID'].isin(size[size].index)]
Out[18]:
Name ID
0 A 1
1 B 2
2 C 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.