根据条件从熊猫系列中删除重复项

Question

I have a Pandas series as: 我的熊猫系列如下：

    increased   1.691759
    increased   1.601759
    reports     1.881759
    reports     1.491759
    president   1.386294
    president   1.791759
    exclusive   1.381759
    exclusive   1.291759
    bank        1.386294
    bank        1.791759
    ........    ........
    ........    .......

I just wants remove duplicates words from series and by retaining the word with a higher numeric value. 我只想从系列中删除重复的单词，并保留较高数值的单词。 So, expected output, 因此，预期输出

increased   1.691759
reports     1.881759
president   1.791759
exclusive   1.381759
bank        1.791759
........    ........
........    .......

I have tried it by converting a series into pandas dataframe an it works fine. 我通过将系列转换为pandas数据框进行了尝试，效果很好。 But, it would be time consuming process as I have large series. 但是，这将是一个非常耗时的过程，因为我有很多系列文章。 So, all I want to process in existing series only. 因此，我只想在现有系列中进行处理。

Answer 1

You can use drop_duplicates after you sort col2 . 对col2排序后，可以使用drop_duplicates 。 Drop duplicates keeps the first by default, so if you sort by col2 so that the largest is first, it will keep the largest: 删除重复项默认情况下保留第一个，因此，如果您按col2排序，以使最大的副本排在最前面，它将保留最大的副本：

df.sort_values('col2', ascending=False).drop_duplicates('col1')

        col1      col2
2    reports  1.881759
5  president  1.791759
9       bank  1.791759
0  increased  1.691759
6  exclusive  1.381759

Alternative using groupby and tail : 使用groupby和tail替代方法 ：

Another way would be to do this: 另一种方法是这样做：

df.sort_values('col2').groupby('col1').tail(1)

        col1      col2
6  exclusive  1.381759
0  increased  1.691759
5  president  1.791759
9       bank  1.791759
2    reports  1.881759

Edit : Based on your comment, to convert to a series for further use you can do: 编辑：根据您的评论，要转换为系列以供进一步使用，您可以执行以下操作：

df.sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']

col1
reports      1.881759
president    1.791759
bank         1.791759
increased    1.691759
exclusive    1.381759
Name: col2, dtype: float64

Or do a groupby directly on the series (but this is slower, see benchmarks): 或直接在该系列中进行分组（但速度较慢，请参见基准）：

s.sort_values().groupby(s.index).tail(1)

Benchmark 基准

I tested this with a Series of length 1000000, and even with transforming it to a dataframe and back to a series, it takes less than a second. 我用长度为1000000的Series进行了测试，即使将其转换为数据帧并返回到Series，也用不到一秒钟的时间。 You might be able to find a faster way without transforming, but this isn't so bad IMO 您可能可以找到更快速的方法而无需进行转换，但是IMO并不是很糟糕

df = pd.DataFrame({'col1':np.random.choice(['increased', 'reports', 'president', 'exclusive', 'bank'], 1000000), 'col2':np.random.randn(1000000)})

s = pd.Series(df.set_index('col1').col2)

>>> s.head()
col1
president    0.600691
increased    1.752238
president   -1.409425
bank         0.349149
reports      0.596207
Name: col2, dtype: float64
>>> len(s)
1000000

import timeit

def test(s = s):
    return s.to_frame().reset_index().sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']

>>> timeit.timeit(test, number=10) / 10
0.685569432300008

Applying groupby directly on a Series is slower: 直接在系列上应用groupby比较慢：

def gb_test(s=s):
    return s.sort_values().groupby(s.index).tail(1)

>>> timeit.timeit(gb_test, number=10) / 10
0.7673859989999983

Answer 2

I'm not sure if this method will work on Pandas Dataframe but you can try using set() function. 我不确定该方法是否适用于Pandas Dataframe，但是您可以尝试使用set()函数。 The set() function removes all duplicates. set()函数删除所有重复项。

根据条件从熊猫系列中删除重复项

问题描述

2 个解决方案

解决方案1
5 2018-08-02 17:30:15

解决方案2
0 2018-08-02 17:32:50

根据条件从熊猫系列中删除重复项

问题描述

2 个解决方案

解决方案1 5 2018-08-02 17:30:15

解决方案2 0 2018-08-02 17:32:50

解决方案1
5 2018-08-02 17:30:15

解决方案2
0 2018-08-02 17:32:50