简体   繁体   English

根据条件从熊猫系列中删除重复项

[英]Remove duplicates from pandas series based on condition

I have a Pandas series as: 我的熊猫系列如下:

    increased   1.691759
    increased   1.601759
    reports     1.881759
    reports     1.491759
    president   1.386294
    president   1.791759
    exclusive   1.381759
    exclusive   1.291759
    bank        1.386294
    bank        1.791759
    ........    ........
    ........    .......

I just wants remove duplicates words from series and by retaining the word with a higher numeric value. 我只想从系列中删除重复的单词,并保留较高数值的单词。 So, expected output, 因此,预期输出

increased   1.691759
reports     1.881759
president   1.791759
exclusive   1.381759
bank        1.791759
........    ........
........    .......

I have tried it by converting a series into pandas dataframe an it works fine. 我通过将系列转换为pandas数据框进行了尝试,效果很好。 But, it would be time consuming process as I have large series. 但是,这将是一个非常耗时的过程,因为我有很多系列文章。 So, all I want to process in existing series only. 因此,我只想在现有系列中进行处理。

You can use drop_duplicates after you sort col2 . col2排序后,可以使用drop_duplicates Drop duplicates keeps the first by default, so if you sort by col2 so that the largest is first, it will keep the largest: 删除重复项默认情况下保留第一个,因此,如果您按col2排序,以使最大的副本排在最前面,它将保留最大的副本:

df.sort_values('col2', ascending=False).drop_duplicates('col1')

        col1      col2
2    reports  1.881759
5  president  1.791759
9       bank  1.791759
0  increased  1.691759
6  exclusive  1.381759

Alternative using groupby and tail : 使用groupbytail替代方法

Another way would be to do this: 另一种方法是这样做:

df.sort_values('col2').groupby('col1').tail(1)

        col1      col2
6  exclusive  1.381759
0  increased  1.691759
5  president  1.791759
9       bank  1.791759
2    reports  1.881759

Edit : Based on your comment, to convert to a series for further use you can do: 编辑 :根据您的评论,要转换为系列以供进一步使用,您可以执行以下操作:

df.sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']

col1
reports      1.881759
president    1.791759
bank         1.791759
increased    1.691759
exclusive    1.381759
Name: col2, dtype: float64

Or do a groupby directly on the series (but this is slower, see benchmarks): 或直接在该系列中进行分组(但速度较慢,请参见基准):

s.sort_values().groupby(s.index).tail(1)

Benchmark 基准

I tested this with a Series of length 1000000, and even with transforming it to a dataframe and back to a series, it takes less than a second. 我用长度为1000000的Series进行了测试,即使将其转换为数据帧并返回到Series,也用不到一秒钟的时间。 You might be able to find a faster way without transforming, but this isn't so bad IMO 您可能可以找到更快速的方法而无需进行转换,但是IMO并不是很糟糕

df = pd.DataFrame({'col1':np.random.choice(['increased', 'reports', 'president', 'exclusive', 'bank'], 1000000), 'col2':np.random.randn(1000000)})

s = pd.Series(df.set_index('col1').col2)

>>> s.head()
col1
president    0.600691
increased    1.752238
president   -1.409425
bank         0.349149
reports      0.596207
Name: col2, dtype: float64
>>> len(s)
1000000

import timeit

def test(s = s):
    return s.to_frame().reset_index().sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']

>>> timeit.timeit(test, number=10) / 10
0.685569432300008

Applying groupby directly on a Series is slower: 直接在系列上应用groupby比较慢:

def gb_test(s=s):
    return s.sort_values().groupby(s.index).tail(1)

>>> timeit.timeit(gb_test, number=10) / 10
0.7673859989999983

I'm not sure if this method will work on Pandas Dataframe but you can try using set() function. 我不确定该方法是否适用于Pandas Dataframe,但是您可以尝试使用set()函数。 The set() function removes all duplicates. set()函数删除所有重复项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM