[英]Pandas: Drop consecutive duplicates
What's the most efficient way to drop only consecutive duplicates in pandas?在熊猫中只删除连续重复项的最有效方法是什么?
drop_duplicates gives this: drop_duplicates 给出了这个:
In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [4]: a.drop_duplicates()
Out[4]:
1 1
2 2
4 3
dtype: int64
But I want this:但我想要这个:
In [4]: a.something()
Out[4]:
1 1
2 2
4 3
5 2
dtype: int64
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask所以上面使用布尔标准,我们将数据帧与移动了 -1 行的数据帧进行比较以创建掩码
Another method is to use diff
:另一种方法是使用
diff
:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
But this is slower than the original method if you have a large number of rows.但是如果您有大量行,这比原始方法慢。
Update更新
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)
or just shift()
as the default is a period of 1, this returns the first consecutive value:感谢 Bjarke Ebert 指出一个细微的错误,我实际上应该使用
shift(1)
或只是shift()
因为默认值为 1,这将返回第一个连续值:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
Note the difference in index values, thanks @BjarkeEbert!注意索引值的差异,谢谢@BjarkeEbert!
Here is an update that will make it work with multiple columns.这是一个更新,它将使它适用于多列。 Use ".any(axis=1)" to combine the results from each column:
使用“.any(axis=1)”组合每一列的结果:
cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]
Since we are going for most efficient way
, ie performance, let's use array data to leverage NumPy.由于我们追求
most efficient way
,即性能,让我们使用数组数据来利用 NumPy。 We will slice one-off slices and compare, similar to shifting method discussed earlier in @EdChum's post
.我们将切片一次性切片并进行比较,类似于前面
@EdChum's post
讨论的移位方法。 But with NumPy slicing we would end up with one-less array, so we need to concatenate with a True
element at the start to select the first element and hence we would have an implementation like so -但是使用 NumPy 切片我们最终会得到一个无一个数组,所以我们需要在开始时连接一个
True
元素来选择第一个元素,因此我们会有一个像这样的实现 -
def drop_consecutive_duplicates(a):
ar = a.values
return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]
Sample run -样品运行 -
In [149]: a
Out[149]:
1 1
2 2
3 2
4 3
5 2
dtype: int64
In [150]: drop_consecutive_duplicates(a)
Out[150]:
1 1
2 2
4 3
5 2
dtype: int64
Timings on large arrays comparing @EdChum's solution
-比较
@EdChum's solution
大型数组的计时 -
In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))
In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop
In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop
In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop
In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop
So, there's some improvement!所以,有一些改进!
Get major boost for values only!只为价值获得重大提升!
If only the values are needed, we could get major boost by simply indexing into the array data, like so -如果只需要值,我们可以通过简单地索引到数组数据来获得重大提升,就像这样 -
def drop_consecutive_duplicates(a):
ar = a.values
return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]
Sample run -样品运行 -
In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])
Timings -时间——
In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop
In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop
For other Stack explorers, building off johnml1135's answer above.对于其他堆栈资源管理器,构建上面 johnml1135 的答案。 This will remove the next duplicate from multiple columns but not drop all of the columns.
这将从多个列中删除下一个重复项,但不会删除所有列。 When the dataframe is sorted it will keep the first row but drop the second row if the "cols" match, even if there are more columns with non-matching information.
对数据框进行排序时,它会保留第一行,但如果“cols”匹配,则删除第二行,即使有更多具有不匹配信息的列。
cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]
Here is a function that handles both pd.Series
and pd.Dataframes
.这是一个处理
pd.Series
和pd.Dataframes
。 You can mask/drop, choose the axis and finaly choose to drop with 'any' or 'all' 'NaN'.您可以屏蔽/删除,选择轴并最终选择删除“任何”或“全部”“NaN”。 It is not optimized in term of computation time, but it has the advantage to be robust and pretty clear.
它没有在计算时间方面进行优化,但它具有鲁棒性和非常清晰的优点。
import numpy as np
import pandas as pd
# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop=False,
keep_first=True,
axis=0, how='all'):
'''
#Function built with the help of:
# 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
# 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
Input:
df should be a pandas.DataFrame of a a pandas.Series
Output:
df of ts with masked or droped values
'''
# Mask keeping the first occurence
if keep_first:
df = df.mask(df.shift(1) == df)
# Mask including the first occurence
else:
df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))
# Drop the values (e.g. rows are deleted)
if drop:
return df.dropna(axis=axis, how=how)
# Only mask the values (e.g. become 'NaN')
else:
return df
Here is a test code to include in the script:这是要包含在脚本中的测试代码:
if __name__ == "__main__":
# With time series
print("With time series:\n")
ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')],
index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
print("#Original ts:")
print(ts)
print("\n## 1) Mask keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False,
keep_first=True))
print("\n## 2) Mask including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False,
keep_first=False))
print("\n## 3) Drop keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True,
keep_first=True))
print("\n## 4) Drop including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True,
keep_first=False))
# With dataframes
print("With dataframe:\n")
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:9,0]=40
df.iloc[8:15,1]=22
df.iloc[8:12,2]=0.23
print("#Original df:")
print(df)
print("\n## 5) Mask keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False,
keep_first=True))
print("\n## 6) Mask including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False,
keep_first=False))
print("\n## 7) Drop 'any' keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=True,
how='any'))
print("\n## 8) Drop 'all' keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=True,
how='all'))
print("\n## 9) Drop 'any' including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=False,
how='any'))
print("\n## 10) Drop 'all' including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=False,
how='all'))
And here is the expected result:这是预期的结果:
With time series:
#Original ts:
0 1.0
1 1.0
2 2.0
3 2.0
4 3.0
5 2.0
6 6.0
7 6.0
8 NaN
9 6.0
10 6.0
11 NaN
12 NaN
dtype: float64
## 1) Mask keeping the first occurence:
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 2.0
6 6.0
7 NaN
8 NaN
9 6.0
10 NaN
11 NaN
12 NaN
dtype: float64
## 2) Mask including the first occurence:
0 NaN
1 NaN
2 NaN
3 NaN
4 3.0
5 2.0
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
dtype: float64
## 3) Drop keeping the first occurence:
0 1.0
2 2.0
4 3.0
5 2.0
6 6.0
9 6.0
dtype: float64
## 4) Drop including the first occurence:
4 3.0
5 2.0
dtype: float64
With dataframe:
#Original df:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 40.000000 -0.470958 -0.339213
6 40.000000 1.613524 0.271641
7 40.000000 -1.810958 -1.568372
8 40.000000 22.000000 0.230000
9 -0.296557 22.000000 0.230000
10 -0.921238 22.000000 0.230000
11 -0.170195 22.000000 0.230000
12 1.460457 22.000000 -0.295418
13 0.307825 22.000000 -0.759131
14 0.287392 22.000000 0.378315
## 5) Mask keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN 22.000000 0.230000
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 6) Mask including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 NaN 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN NaN NaN
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 7) Drop 'any' keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
## 8) Drop 'all' keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN 22.000000 0.230000
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 9) Drop 'any' including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
## 10) Drop 'all' including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 NaN 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
Here's a variant of EdChum's answer that treats consecutive NaNs as duplicates, too:这是EdChum 答案的一个变体,它也将连续的 NaN 视为重复项:
def remove_consecutive_duplicates_and_nans(s):
# By default, `shift` uses NaN as a fill value, which breaks our
# removal of consecutive NaNs. Hence we use a different sentinel
# object instead.
shifted = s.astype(object).shift(-1, fill_value=object())
return s.loc[
(shifted != s)
& ~(shifted.isna() & s.isna())
]
Create new column.创建新列。
df['match'] = df.col1.eq(df.col1.shift())
Then:然后:
df = df[df['match']==False]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.