简体   繁体   English

从有条件的熊猫数据框中删除行

[英]Remove rows from pandas dataframe with condition

I have a dataframe that looks like this:我有一个看起来像这样的数据框:

import pandas as pd将熊猫导入为 pd

### create toy data set
data = [[1111,'10/1/2021',21,123],
        [1111,'10/1/2021',-21,123],
        [1111,'10/1/2021',21,123],
        [2222,'10/2/2021',15,234],
        [2222,'10/2/2021',15,234],
        [3333,'10/3/2021',15,234],
        [3333,'10/3/2021',15,234]]

df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])

What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case.我想要做的是删除个人、日期和抄送相同的行,但数字在一种情况下为负值,而在另一种情况下为正值。 For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1).例如,在前三行中,我将删除第 1 行和第 2 行(因为 21 和 -21 值在绝对值上相等),但我不想删除第 3 行(因为我已经考虑了负值在第 2 行中通过消除第 1 行)。 Also, I don't want to remove duplicated values if the corresponding number values are positive.另外,如果相应的数值为正,我不想删除重复的值。 I have tried a variety of duplicated() approaches, but just can't get it right.我尝试了各种重复的()方法,但就是做对了。

Expected results would be:预期结果是:

  Individual       date  number   cc
0        1111  10/1/2021      21  123
1        2222  10/2/2021      15  234
2        2222  10/2/2021      15  234
3        3333  10/3/2021      15  234
4        3333  10/3/2021      15  234

Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.因此,前两行被移除,但第三​​行不会被移除,因为负值已经被考虑在内。

Any assistance would be appreciated.任何援助将不胜感激。 I am trying to do this without a loop, but it may be unavoidable.我试图在没有循环的情况下执行此操作,但这可能是不可避免的。 It seems similar to this question , but I can't figure out how to make it work in my case, as I am trying to avoid loops.这似乎与这个问题相似,但我不知道如何让它在我的情况下工作,因为我试图避免循环。

I can't be sure since you did not post your expected output, but you could try the below.我无法确定,因为您没有发布预期的输出,但您可以尝试以下操作。 Create a separate df called n that contains the rows with -ve 'number' and join it to the original with indicator=True .创建一个名为n的单独 df ,其中包含带有 -ve 'number' 的行,并使用indicator=True将其连接到原始行。

n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)

>>> df

   Individual       date  number   cc     _merge
0        1111  10/1/2021      21  123       both
1        1111  10/1/2021     -21  123       both
2        1111  10/1/2021      21  123       both
3        2222  10/2/2021      15  234  left_only
4        2222  10/2/2021      15  234  left_only
5        3333  10/3/2021      15  234  left_only
6        3333  10/3/2021      15  234  left_only

This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.这将使我们能够识别具有 -ve 'number' 行的 Individual/date/cc 组。


Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2) , concatenating that with the rest of the df:然后,您可以在 _merge 中找到带有 'both' 的行,并且仅使用这些行来执行groupby.head(2) ,将其与 df 的其余部分连接起来:

out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
           df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)       

Which prints:哪个打印:

   Individual       date  number   cc
0        1111  10/1/2021      21  123
1        1111  10/1/2021     -21  123
3        2222  10/2/2021      15  234
4        2222  10/2/2021      15  234
5        3333  10/3/2021      15  234
6        3333  10/3/2021      15  234

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM