比较DataFrame所有行的最快方法

Question

I have written a program (in Python 3.6) that tries to map the columns of a users csv/excel to a template xls I have. 我编写了一个程序（在Python 3.6中），试图将用户csv / excel的列映射到我拥有的模板xls。 So far so good but part of this process has to be user's data processing which are contacts. 到目前为止，到目前为止还不错，但是此过程的一部分必须是作为联系人的用户数据处理。 For example I want to delete duplicates ,merge data etc. To do this I need to compare every row to all other rows which is costly. 例如，我想删除重复项，合并数据等。为此，我需要将每一行与所有其他行进行比较，这比较昂贵。 Every user's csv I read has ~ 2000-4000 rows but I want it to be efficient for even more rows. 我读取的每个用户的CSV都有〜2000-4000行，但我希望它对更多行都有效。 I have stored the data in a pd.DataFrame. 我已经将数据存储在pd.DataFrame中。

Is there a more efficient way to do the comparisons beside brute force? 除了蛮力外，还有没有更有效的方法进行比较？

Thanks 谢谢

Answer 1

First, what code have you tried? 首先，您尝试了什么代码？

But to delete duplicates, this is very easy in pandas. 但是要删除重复项，这在熊猫中非常容易。 Example below: 下面的例子：

import pandas as pd
import numpy as np
# Creating the Test DataFrame below -------------------------------
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['AA1233445','A9875', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(dfp)

#Output Below----------------

     A    B           C            D           E
0  NaN  1.0   AA1233445     123456.0      Assign
1  NaN  0.0       A9875     123456.0    Unassign
2  3.0  3.0       rmacy    1234567.0      Assign
3  4.0  5.0    Idaho Rx   12345678.0        Ugly
4  5.0  0.0    Ab123455      12345.0  Appreciate
5  5.0  0.0    TV192837      12345.0        Undo
6  3.0  NaN          RX   12345678.0      Assign
7  1.0  9.0  Ohio Drugs  123456789.0    Unicycle
8  5.0  0.0     RX12345    1234567.0      Assign
9  NaN  0.0  USA Pharma          NaN     Unicorn


# Remove all records with duplicated values in column a:
# keep='first' keeps the first occurences.

df2 = dfp[dfp.duplicated(['A'], keep='first')]
#output
     A    B           C           D         E
1  NaN  0.0       A9875    123456.0  Unassign
5  5.0  0.0    TV192837     12345.0      Undo
6  3.0  NaN          RX  12345678.0    Assign
8  5.0  0.0     RX12345   1234567.0    Assign
9  NaN  0.0  USA Pharma         NaN   Unicorn

if you want to have a new dataframe with no dupes that checks across all columns use the tilde. 如果要创建一个没有重复数据的新数据框以检查所有列，请使用波浪号。 the ~ operator is essentially the not equal to or != operator. ~运算符本质上是not equal to或!=运算符。 official documentation here 官方文件在这里

df2 = dfp[~dfp.duplicated(keep='first')]

比较DataFrame所有行的最快方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-05-26 15:51:07

比较DataFrame所有行的最快方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-05-26 15:51:07

解决方案1
1 已采纳 2017-05-26 15:51:07