[英]Remove rows from pandas dataframe with condition
I have a dataframe that looks like this:我有一个看起来像这样的数据框:
import pandas as pd将熊猫导入为 pd
### create toy data set
data = [[1111,'10/1/2021',21,123],
[1111,'10/1/2021',-21,123],
[1111,'10/1/2021',21,123],
[2222,'10/2/2021',15,234],
[2222,'10/2/2021',15,234],
[3333,'10/3/2021',15,234],
[3333,'10/3/2021',15,234]]
df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])
What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case.我想要做的是删除个人、日期和抄送相同的行,但数字在一种情况下为负值,而在另一种情况下为正值。 For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1).
例如,在前三行中,我将删除第 1 行和第 2 行(因为 21 和 -21 值在绝对值上相等),但我不想删除第 3 行(因为我已经考虑了负值在第 2 行中通过消除第 1 行)。 Also, I don't want to remove duplicated values if the corresponding number values are positive.
另外,如果相应的数值为正,我不想删除重复的值。 I have tried a variety of duplicated() approaches, but just can't get it right.
我尝试了各种重复的()方法,但就是做对了。
Expected results would be:预期结果是:
Individual date number cc
0 1111 10/1/2021 21 123
1 2222 10/2/2021 15 234
2 2222 10/2/2021 15 234
3 3333 10/3/2021 15 234
4 3333 10/3/2021 15 234
Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.因此,前两行被移除,但第三行不会被移除,因为负值已经被考虑在内。
Any assistance would be appreciated.任何援助将不胜感激。 I am trying to do this without a loop, but it may be unavoidable.
我试图在没有循环的情况下执行此操作,但这可能是不可避免的。 It seems similar to this question , but I can't figure out how to make it work in my case, as I am trying to avoid loops.
这似乎与这个问题相似,但我不知道如何让它在我的情况下工作,因为我试图避免循环。
I can't be sure since you did not post your expected output, but you could try the below.我无法确定,因为您没有发布预期的输出,但您可以尝试以下操作。 Create a separate df called
n
that contains the rows with -ve 'number' and join it to the original with indicator=True
.创建一个名为
n
的单独 df ,其中包含带有 -ve 'number' 的行,并使用indicator=True
将其连接到原始行。
n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)
>>> df
Individual date number cc _merge
0 1111 10/1/2021 21 123 both
1 1111 10/1/2021 -21 123 both
2 1111 10/1/2021 21 123 both
3 2222 10/2/2021 15 234 left_only
4 2222 10/2/2021 15 234 left_only
5 3333 10/3/2021 15 234 left_only
6 3333 10/3/2021 15 234 left_only
This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.这将使我们能够识别具有 -ve 'number' 行的 Individual/date/cc 组。
Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2)
, concatenating that with the rest of the df:然后,您可以在 _merge 中找到带有 'both' 的行,并且仅使用这些行来执行
groupby.head(2)
,将其与 df 的其余部分连接起来:
out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)
Which prints:哪个打印:
Individual date number cc
0 1111 10/1/2021 21 123
1 1111 10/1/2021 -21 123
3 2222 10/2/2021 15 234
4 2222 10/2/2021 15 234
5 3333 10/3/2021 15 234
6 3333 10/3/2021 15 234
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.