从熊猫数据框中删除重复索引的更快方法

Question

What is a efficient way to remove duplicated rows from a pandas dataframe where I would like always to keep the first value that is not NAN . 从熊猫数据框中删除重复的行的有效方法是什么，我希望始终保留第一个不是NAN的值 。

Example: 例：

import pandas as pd
import numpy as np
data = pd.DataFrame({'a': [np.nan,np.nan,2,2,3,3,5],
                     'b': [2,1,1,1,np.nan,2,1]},
                     index=[pd.Timestamp('2018-03-01'), pd.Timestamp('2018-03-02'),
                            pd.Timestamp('2018-03-02'), pd.Timestamp('2018-03-02'),
                            pd.Timestamp('2018-03-03'), pd.Timestamp('2018-03-03'),
                            pd.Timestamp('2018-03-04')])

print(data)
>              a    b
> 2018-03-01  NaN  2.0
> 2018-03-02  NaN  1.0  # take 'a' from next row, 'b' from this row
> 2018-03-02  2.0  1.0
> 2018-03-02  2.0  1.0
> 2018-03-03  3.0  NaN  # take 'a' from this row but 'b' from next row
> 2018-03-03  3.0  2.0
> 2018-03-04  5.0  1.0

# Is there something faster?
x = data.groupby(data.index).first()
print(x)

Should give: 应该给：

>               a    b
> 2018-03-01  NaN  2.0
> 2018-03-02  2.0  1.0
> 2018-03-03  3.0  2.0
> 2018-03-04  5.0  1.0

data.groupby(data.index).first() does that job but it is ridiculously slow. data.groupby(data.index).first()完成这项工作，但是速度很慢。 For a dataframe of shape (5'730'238, 7) it required 40 minutes to remove the duplicates, for another table of shape (1'191'704, 339) it took 5 hours 20 minutes . 对于形状为(5'730'238, 7)的数据(5'730'238, 7)需要40分钟才能删除重复项，对于另一个形状为(1'191'704, 339)表，则需要5个小时20分钟 。 (datetime index, all columns integer/float). （日期时间索引，所有列均为整数/浮点数）。 Note that the data might contain only few duplicated values. 请注意，数据可能仅包含少量重复值。

In another question , they suggest to use data[~data.index.duplicated(keep='first')] , but this does not handle NANs in the desired way. 在另一个问题中，他们建议使用data[~data.index.duplicated(keep='first')] ，但这不能以所需的方式处理NAN。

It doesn't really matter, if I choose first , last , mean or whatever, as long as it is fast. 如果我选择first ， last ， mean或其他什么都没关系，只要它是快速的即可。

Is there a faster way than groupby or is there a problem with my data that's making it slow. 是否有比groupby更快的方法，或者我的数据有问题导致它变慢。

Answer 1

I think problem is not in algorithm of groupby but in memory consumption. 我认为问题不在于groupby算法，而在于内存消耗。 For million records with seven float columns it took 300 ms, you can speedup twice by using resample with your frequency. 对于使用七个浮动列的300万条记录花费了300毫秒，您可以通过对频率进行重新采样来加快速度两次。 But 2 million records with 400 float columns is 7Gb of memory and it becomes hell of swapping memory to disk with. 但是，具有400个浮点列的200万条记录是7Gb的内存，将内存与磁盘交换变得非常困难。 On my 16Gb of physical memory and SSD HDD it took 3 minutes to perform groupby on that sample size (with sorted index) most of the time was used to swap memory to disk. 在我的16Gb物理内存和SSD HDD上，花了3分钟时间对样本大小（带有排序索引）执行groupby ，大部分时间用于将内存交换到磁盘。

I suggest to: sort your data then split and process it in batches. 我建议：对数据进行排序，然后拆分并分批处理。 If you can split it before sorting do it before. 如果可以在排序之前将其拆分，请先进行拆分。

1. Sort 1.排序

It will sort it as in your sample 它将按照您的样本进行排序

df_sample.reset_index(inplace=True)
df_sample.sort_values(by=df_sample.columns.tolist(),na_position ='first',inplace=True)
df_sample.set_index('date',inplace=True)

2. Work in batches 2.分批工作

This is not a safe splitting method but enough for performance testing 这不是安全的拆分方法，但足以进行性能测试

step = 20000
print (df_sample.shape)
%timeit x = pd.concat([df_sample[s:s+step].resample('min').first() for s in range(0,df_sample.shape[0],step)],axis=0)

It took two minutes instead of three without splitting. 花了两分钟而不是三分钟而没有分裂。

从熊猫数据框中删除重复索引的更快方法

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-03-26 11:38:15

从熊猫数据框中删除重复索引的更快方法

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-03-26 11:38:15

解决方案1
0 已采纳 2018-03-26 11:38:15