简体   繁体   English

有效地合并Python中的两个数据集

[英]Merging two datasets in Python efficiently

What would anyone consider the most efficient way to merge two datasets using Python? 有谁会认为使用Python合并两个数据集的最有效方法是什么?

A little background - this code will take 100K+ records in the following format: 背景知识-该代码将采用以下格式记录100K +条记录:

{user: aUser, transaction: UsersTransactionNumber}, ...

and using the following data 并使用以下数据

{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...

to create 创造

{user: aUser, activationNumber: assoiciatedActivationNumber}, ...

NB These are not Python dictionaries, just the closest thing to portraying record format cleanly. 注意:这些不是Python词典,只是最干净地刻画记录格式的内容。

So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? 因此,从理论上讲,我要做的就是创建一个连接了一个公共键的两个列表(或表)的视图-首先,这使我指向了集合(联合等),但是在我开始深入学习它们之前,它们是吗?要走的路? So far I felt this could be implemented as: 到目前为止,我认为这可以实现为:

  1. Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure? 创建字典列表,然后每次都在列表上进行比较以比较键,但是,在最坏的情况下,这可能会导致len(inputDict)* len(outputDict)<-不确定?

  2. Manipulate the data as an in-memory SQLite Table? 将数据作为内存中的SQLite表处理? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier. 似乎并不是对Python 2.4没有严格的要求,这会使生活更轻松。

  3. Some kind of Set based magic? 某种基于Set的魔术?

Clarification 澄清度

The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. 该脚本的全部目的是总结一下,实际的数据集来自两个不同的来源。 The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. 用户和交易编号以CSV的形式出现,这是性能测试的输出,该测试正在测试电子邮件激活码的吞吐量。 The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. 第二个数据集来自解析测试邮箱,其中包含事务ID和激活代码。 The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up. 然后,此测试的输出是CSV,它将被抽回到性能测试的阶段2,使用配对的激活码激活用户帐户。

Apologies if my notation for the records was misleading, I have updated them accordingly. 抱歉,如果我对记录的表示方式具有误导性,我已对其进行相应更新。

Thanks for the replies, I am going to give two ideas a try: 感谢您的答复,我将尝试两个想法:

  • Sorting the lists first (I don't know how expensive this is) 首先对列表进行排序(我不知道这有多昂贵)
  • Creating a dictionary with the transactionCodes as the key then store the user and activation code in a list as the value 创建一个以transactionCodes为键的字典,然后将用户和激活码存储在列表中作为值

Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming. 对于我来说,性能并不是最重要的,我只是想尝试并养成使用Python编程的良好习惯。

Here's a radical approach. 这是一种激进的方法。

Don't. 别。

You have two CSV files; 您有两个CSV文件; one (users) is clearly the driver. 一个(用户)显然是驱动程序。 Leave this alone. 别管它了。 The other -- transaction codes for a user -- can be turned into a simple dictionary. 另一个-用户的交易代码-可以变成一个简单的字典。

Don't "combine" or "join" anything except when absolutely necessary. 除非绝对必要,否则请勿“合并”或“加入”任何内容。 Certainly don't "merge" or "pre-join". 当然不要“合并”或“预加入”。

Write your application do simply do simple lookups in the other collection. 编写您的应用程序只是在其他集合中进行简单的查找。

Create a list of dictionaries and iterate over the list comparing the key each time, 创建字典列表,并在列表上进行迭代,每次都比较键,

Close. 关。 It looks like this. 看起来像这样。 Note: No Sort. 注意:无排序。

import csv
with open('activations.csv','rb') as act_data:
    rdr= csv.DictReader( act_data)
    activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
    rdr= csv.DictReader( user_data )
    with open( 'users_2.csv','wb') as updated_data:
        wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
        for user in rdr:
             user['some_field']= activations[user['user_id_column']]['some_field']
             wtr.writerow( user )

This is fast and simple. 这既快速又简单。 Save the dictionaries (use shelve or pickle ). 保存字典(使用shelvepickle )。

however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure? 但是,在最坏的情况下,这可能会导致len(inputDict)* len(outputDict)<-不确定?

False. 假。

One list is the "driving" list. 一个列表是“驾驶”列表。 The other is the lookup list. 另一个是查找列表。 You'll drive by iterating through users and lookup appropriate values for transaction. 您将遍历用户并查找事务的适当值来驱动。 This is O ( n ) on the list of users. 在用户列表上为On )。 The lookup is O ( 1 ) because dictionaries are hashes. 由于字典是散列,因此查找为O (1)。

Sort the two data sets by transaction number. 按交易编号对两个数据集进行排序。 That way, you always only need to keep one row of each in memory. 这样,您始终只需要在内存中保留一行。

This looks like a typical use for dictionaries with transaction number as key. 对于以交易号为键的字典来说,这似乎是一种典型用法。 But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed. 但是您不必创建通用结构,只需构建查找字典并根据需要使用它们即可。

I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber . 我将创建一个地图myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} ,然后在{user: myUser, transaction: myTransactionNumber}条目上进行迭代,然后在地图中搜索所需的myTransactionNumber The complexity of a search should be O(log N) where N is amount of the entries in the set. 搜索的复杂度应为O(log N) ,其中N是集合中条目的数量。 So the overal complexity would be O(M*log N) where M is amount of user entries. 因此,总体复杂度为O(M*log N) ,其中M是用户条目的数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM