Python Pandas Drop Duplicates排在倒数第二位

Question

在pandas数据框中选择每个重复集的倒数第二个的最有效方法是什么？

例如，我基本上想要做这个操作：

df = df.drop_duplicates(['Person','Question'],take_last=True)

但是这个：

df = df.drop_duplicates(['Person','Question'],take_second_last=True)

抽象问题：如果重复既不是最大值也不是最小值，如何选择要保留的副本？

Answer 1

使用groupby.apply：

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': np.arange(10), 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  1  3  3
4  2  4  4
5  2  5  5
6  2  6  6
7  3  7  7
8  3  8  8
9  4  9  9

(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
2  1  2  2
5  2  5  5
7  3  7  7
9  4  9  9

使用不同的DataFrame，子集有两列：

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  1  0
1  1  1  1
2  1  2  2
3  1  1  3
4  2  2  4
5  2  2  5
6  2  2  6
7  3  3  7
8  3  3  8
9  4  4  9

(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
1  1  1  1
2  1  2  2
5  2  2  5
7  3  3  7
9  4  4  9

Answer 2

你可以groupby/tail(2)来获取最后2个项目，然后groupby/head(1)从尾部获取第一个项目：

df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

如果组中只有一个项目，则tail(2)只返回一个项目。

例如，

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)

内置的groupby方法（例如tail和head ）通常比使用自定义Python函数的groupby/apply快得多。 如果有很多组，则尤其如此：

In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop

In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop

或者， ayhan提出了一个很好的改进：

alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)

In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop

Python Pandas Drop Duplicates排在倒数第二位

问题描述

2 个解决方案

解决方案1
11 已采纳 2016-08-15 14:46:50

解决方案2
3 2016-08-16 00:45:19

Python Pandas Drop Duplicates排在倒数第二位

问题描述

2 个解决方案

解决方案1 11 已采纳 2016-08-15 14:46:50

解决方案2 3 2016-08-16 00:45:19

解决方案1
11 已采纳 2016-08-15 14:46:50

解决方案2
3 2016-08-16 00:45:19