[英]Remove duplicates from pandas series based on condition
I have a Pandas series as: 我的熊猫系列如下:
increased 1.691759
increased 1.601759
reports 1.881759
reports 1.491759
president 1.386294
president 1.791759
exclusive 1.381759
exclusive 1.291759
bank 1.386294
bank 1.791759
........ ........
........ .......
I just wants remove duplicates words from series and by retaining the word with a higher numeric value. 我只想从系列中删除重复的单词,并保留较高数值的单词。 So, expected output,
因此,预期输出
increased 1.691759
reports 1.881759
president 1.791759
exclusive 1.381759
bank 1.791759
........ ........
........ .......
I have tried it by converting a series into pandas dataframe an it works fine. 我通过将系列转换为pandas数据框进行了尝试,效果很好。 But, it would be time consuming process as I have large series.
但是,这将是一个非常耗时的过程,因为我有很多系列文章。 So, all I want to process in existing series only.
因此,我只想在现有系列中进行处理。
You can use drop_duplicates
after you sort col2
. 对
col2
排序后,可以使用drop_duplicates
。 Drop duplicates keeps the first by default, so if you sort by col2
so that the largest is first, it will keep the largest: 删除重复项默认情况下保留第一个,因此,如果您按
col2
排序,以使最大的副本排在最前面,它将保留最大的副本:
df.sort_values('col2', ascending=False).drop_duplicates('col1')
col1 col2
2 reports 1.881759
5 president 1.791759
9 bank 1.791759
0 increased 1.691759
6 exclusive 1.381759
Alternative using groupby
and tail
: 使用
groupby
和tail
替代方法 :
Another way would be to do this: 另一种方法是这样做:
df.sort_values('col2').groupby('col1').tail(1)
col1 col2
6 exclusive 1.381759
0 increased 1.691759
5 president 1.791759
9 bank 1.791759
2 reports 1.881759
Edit : Based on your comment, to convert to a series for further use you can do: 编辑 :根据您的评论,要转换为系列以供进一步使用,您可以执行以下操作:
df.sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']
col1
reports 1.881759
president 1.791759
bank 1.791759
increased 1.691759
exclusive 1.381759
Name: col2, dtype: float64
Or do a groupby directly on the series (but this is slower, see benchmarks): 或直接在该系列中进行分组(但速度较慢,请参见基准):
s.sort_values().groupby(s.index).tail(1)
Benchmark 基准
I tested this with a Series
of length 1000000, and even with transforming it to a dataframe and back to a series, it takes less than a second. 我用长度为1000000的
Series
进行了测试,即使将其转换为数据帧并返回到Series,也用不到一秒钟的时间。 You might be able to find a faster way without transforming, but this isn't so bad IMO 您可能可以找到更快速的方法而无需进行转换,但是IMO并不是很糟糕
df = pd.DataFrame({'col1':np.random.choice(['increased', 'reports', 'president', 'exclusive', 'bank'], 1000000), 'col2':np.random.randn(1000000)})
s = pd.Series(df.set_index('col1').col2)
>>> s.head()
col1
president 0.600691
increased 1.752238
president -1.409425
bank 0.349149
reports 0.596207
Name: col2, dtype: float64
>>> len(s)
1000000
import timeit
def test(s = s):
return s.to_frame().reset_index().sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']
>>> timeit.timeit(test, number=10) / 10
0.685569432300008
Applying groupby
directly on a Series is slower: 直接在系列上应用
groupby
比较慢:
def gb_test(s=s):
return s.sort_values().groupby(s.index).tail(1)
>>> timeit.timeit(gb_test, number=10) / 10
0.7673859989999983
I'm not sure if this method will work on Pandas Dataframe but you can try using set()
function. 我不确定该方法是否适用于Pandas Dataframe,但是您可以尝试使用
set()
函数。 The set()
function removes all duplicates. set()
函数删除所有重复项。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.