简体   繁体   English

如果两列中的连续值相同,如何在python中删除重复项?

[英]How to drop duplicates in python if consecutive values are the same in two columns?

I have a dataframe like below: 我有一个如下数据框:

A   B   C
1   8   23
2   8   22
3   9   45
4   9   45
5   6   12
6   4   10
7   11  12

I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same. 我想删除重复项,如果C也相同,则在连续出现的地方保留第一个值。 EG here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. EG在这里出现的情况'9'是B列是重复的,并且它们在'C'列中的对应出现也是在重复'45'。 In this case i want to retain the first occurence. 在这种情况下,我想保留第一次出现。

Expected Output: 预期产量:

A   B   C
1   8   23
2   8   22
3   9   45
5   6   12
6   4   10
7   11  12

I tried some group by, but didnot know how to drop. 我尝试了一些分组方式,但不知道该如何放弃。

code: 码:

df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
                                                       x['C'].iloc[-1] - x['C'].iloc[0]))

This group by returns me a series, but i want to drop. 该小组归还了我一系列,但我想删除。

Add DataFrame.drop_duplicates by 2 columns: 通过2列添加DataFrame.drop_duplicates

df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
   A   B   C  consecutive
0  1   8  23            1
1  2   8  22            1
2  3   9  45            2
4  5   6  12            3
5  6   4  10            4
6  7  11  12            5

Or chain both conditions with | 或用|链接这两个条件 for bitwise OR : 对于按位OR

df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
   A   B   C
0  1   8  23
1  2   8  22
2  3   9  45
4  5   6  12
5  6   4  10
6  7  11  12

A oneliner to filter out such records is: 过滤掉此类记录的一个方法是:

df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]

Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values: 因此,在这里我们检查列['B', 'C']是否与移位的行相同,如果不相同,则保留以下值:

>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
   A   B   C
0  1   8  23
1  2   8  22
2  3   9  45
4  5   6  12
5  6   4  10
6  7  11  12

This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values: 这是相当可扩展的,因为我们可以定义一个函数,可以轻松地对任意数量的值进行操作:

def drop_consecutive_duplicates(df, *colnames):
    dff = df[list(colnames)]
    return df[(dff.shift() != dff).any(axis=1)]

So you can then filter with: 因此,您可以使用以下方法进行过滤:

drop_consecutive_duplicates(df, 'B', 'C')

You can compute a series of the rows to drop, and then drop them: 您可以计算要删除的一系列行,然后删除它们:

to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]

It gives as expected: 它给出了预期的结果:

   A   B   C
0  1   8  23
1  2   8  22
2  3   9  45
4  5   6  12
5  6   4  10
6  7  11  12

一种简单的方法来检查B和C行之间的差异,然后如果差异为0(重复值),则丢弃值,代码为

 df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]

Code

df1 = df.drop_duplicates(subset=['B', 'C'])  

Result 结果

   A   B   C
0  1   8  23
1  2   8  22
2  3   9  45
4  5   6  12
5  6   4  10
6  7  11  12

If I understand your question correctly, given the following dataframe: 给定以下数据框,如果我正确理解您的问题:

df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})

This one-line code solved your problem using the drop_duplicates method: 此单行代码使用drop_duplicates方法解决了您的问题:

df.drop_duplicates(['B', 'C'])

It gives as expected results: 它给出了预期的结果:

    B   C
0   8  22
1   8  23
2   9  45
4   6  12
5   4  10
6  11  12

Using diff , ne and any over axis=1 : 使用diffneany over axis=1

Note: this method only works for numeric columns 注意:此方法仅适用于数字列

m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])

Output 产量

   A   B   C
0  1   8  23
1  2   8  22
2  3   9  45
4  5   6  12
5  6   4  10
6  7  11  12

Details 细节

df[['B', 'C']].diff()

     B     C
0  NaN   NaN
1  0.0  -1.0
2  1.0  23.0
3  0.0   0.0
4 -3.0 -33.0
5 -2.0  -2.0
6  7.0   2.0

Then we check if any of the values in a row are not equal ( ne ) to 0 : 然后我们检查一行中的any值是否不等于( ne )等于0

df[['B', 'C']].diff().ne(0).any(axis=1)

0     True
1     True
2     True
3    False
4     True
5     True
6     True
dtype: bool

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何聚合 DataFrame 并根据 Python Pandas 中两列中的值删除重复项? - How to aggregate DataFrame and drop duplicates based on values in two columns in Python Pandas? 如何根据列的值(列的名称不同)从 pandas dataframe 中删除重复的列? - How to drop duplicates columns from a pandas dataframe, based on columns' values (columns don't have the same name)? 在多列中删除连续的重复项 - Pandas - Drop consecutive duplicates across multiple columns - Pandas 如何在两列之间删除重复项,但在各列中保留唯一值? - How to drop duplicates between two columns, but keep unique values in respective columns? pandas drop_duplicates 对另外两列值的条件 - pandas drop_duplicates condition on two other columns values Python dataframe 基于两对列删除重复项 - Python dataframe drop duplicates based on two pairs of columns Python Pandas比较多列中的值以获取部分重复和删除记录 - Python Pandas compare values in multiple columns for partial duplicates and drop record 如何根据 DataFrame Python Pandas 中其他 2 列中的值删除一列中的重复项? - How to drop duplicates in one column based on values in 2 other columns in DataFrame in Python Pandas? Python:如何删除重复项? - Python: how to drop duplicates with duplicates? 如何在python中比较两列以获取重复项 - How to compare two columns to get duplicates in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM