简体   繁体   English

比较DataFrame所有行的最快方法

[英]Fastest way to compare all rows of a DataFrame

I have written a program (in Python 3.6) that tries to map the columns of a users csv/excel to a template xls I have. 我编写了一个程序(在Python 3.6中),试图将用户csv / excel的列映射到我拥有的模板xls。 So far so good but part of this process has to be user's data processing which are contacts. 到目前为止,到目前为止还不错,但是此过程的一部分必须是作为联系人的用户数据处理。 For example I want to delete duplicates ,merge data etc. To do this I need to compare every row to all other rows which is costly. 例如,我想删除重复项,合并数据等。为此,我需要将每一行与所有其他行进行比较,这比较昂贵。 Every user's csv I read has ~ 2000-4000 rows but I want it to be efficient for even more rows. 我读取的每个用户的CSV都有〜2000-4000行,但我希望它对更多行都有效。 I have stored the data in a pd.DataFrame. 我已经将数据存储在pd.DataFrame中。

Is there a more efficient way to do the comparisons beside brute force? 除了蛮力外,还有没有更有效的方法进行比较?

Thanks 谢谢

First, what code have you tried? 首先,您尝试了什么代码?

But to delete duplicates, this is very easy in pandas. 但是要删除重复项,这在熊猫中非常容易。 Example below: 下面的例子:

import pandas as pd
import numpy as np
# Creating the Test DataFrame below -------------------------------
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['AA1233445','A9875', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(dfp)

#Output Below----------------

     A    B           C            D           E
0  NaN  1.0   AA1233445     123456.0      Assign
1  NaN  0.0       A9875     123456.0    Unassign
2  3.0  3.0       rmacy    1234567.0      Assign
3  4.0  5.0    Idaho Rx   12345678.0        Ugly
4  5.0  0.0    Ab123455      12345.0  Appreciate
5  5.0  0.0    TV192837      12345.0        Undo
6  3.0  NaN          RX   12345678.0      Assign
7  1.0  9.0  Ohio Drugs  123456789.0    Unicycle
8  5.0  0.0     RX12345    1234567.0      Assign
9  NaN  0.0  USA Pharma          NaN     Unicorn


# Remove all records with duplicated values in column a:
# keep='first' keeps the first occurences.

df2 = dfp[dfp.duplicated(['A'], keep='first')]
#output
     A    B           C           D         E
1  NaN  0.0       A9875    123456.0  Unassign
5  5.0  0.0    TV192837     12345.0      Undo
6  3.0  NaN          RX  12345678.0    Assign
8  5.0  0.0     RX12345   1234567.0    Assign
9  NaN  0.0  USA Pharma         NaN   Unicorn

if you want to have a new dataframe with no dupes that checks across all columns use the tilde. 如果要创建一个没有重复数据的新数据框以检查所有列,请使用波浪号。 the ~ operator is essentially the not equal to or != operator. ~运算符本质上是not equal to!=运算符。 official documentation here 官方文件在这里

df2 = dfp[~dfp.duplicated(keep='first')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM