简体   繁体   English

如何基于密钥合并两个元组列表?

[英]How do I merge two lists of tuples based on a key?

I have two lists of tuples that I need to merge. 我有两个需要合并的元组列表。 This would be comparable to a JOIN in database terms. 这与数据库术语中的JOIN相当。 The order of the tuples in each list may change. 每个列表中元组的顺序可能会改变。 The order of the items in the tuple will not change. 元组中项目的顺序不会改变。 The count of items in A should equal the count in B but there may be a difference. A中的项目数应该等于B中的数量,但可能存在差异。

Here are my two lists of tuples. 这是我的两个元组列表。 There will be 10,000+ of these tuples in each list so performance is a concern. 每个列表中将有10,000多个这样的元组,因此性能是一个问题。 The first element in each tuple is the key common to each list. 每个元组中的第一个元素是每个列表共有的键。

listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

The desired output is: 所需的输出是:

listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)]

Here's the code that I threw together for testing the concept. 这是我为测试这个概念而拼凑的代码。 It works but as you can see, performance is an issue. 它可以工作,但正如你所看到的,性能是一个问题。 The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete. 使用真实数据运行时此代码的性能(每个列表中有10个项目)是不可接受的,因为它可能需要数小时才能完成。

Here's the code: 这是代码:

for row in listA:
    for item in listB:
        if item[0] == row[0]:
            item = list(item)
            del item[0]
            row = list(row)
            merged.append(tuple(row + item))

How can I merge / join the two lists and achieve better performance? 如何合并/加入这两个列表并获得更好的性能?

Inner join two lists of tuples on the first (unique in each list) column using itertools.groupby() suggested by @CoryKramer in the comments : 使用@CoryKramer在评论中建议的 itertools.groupby() ,在第一列(每个列表中唯一)列中连接两个元组列表:

from itertools import groupby
from operator import itemgetter

def inner_join(a, b):
    L = a + b
    L.sort(key=itemgetter(0)) # sort by the first column
    for _, group in groupby(L, itemgetter(0)):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Example: 例:

result = list(inner_join(listA, listB))
assert result == listC

This solution has O(n*log n) time complexity (your solution (in the question) is O(n*n) that is much worse for n ~ 10000 ). 该解决方案具有O(n*log n)时间复杂度(您的解决方案(在问题中)是O(n*n) ,对于n ~ 10000来说更糟糕)。

It doesn't matter for a small n such as 10**4 in the question but in Python 3.5+ you could use heapq.merge() with key parameter to avoid allocating new list ie, for O(1) constant memory solution: 不要紧,一个小n10**4中的问题,但在Python 3.5+,您可以使用heapq.merge()key参数,以避免分配新的列表即对O(1)恒存储解决方案:

from heapq import merge # merge has key parameter in Python 3.5

def inner_join(a, b):
    key = itemgetter(0)
    a.sort(key=key) 
    b.sort(key=key)
    for _, group in groupby(merge(a, b, key=key), key):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Here's a dict-based solution. 这是一个基于字典的解决方案。 It is O(n) linear in time and space algorithm: 它是时间和空间算法中的O(n)线性:

def inner_join(a, b):
    d = {}
    for row in b:
        d[row[0]] = row
    for row_a in a:
        row_b = d.get(row_a[0])
        if row_b is not None: # join
            yield row_a + row_b[1:]

Here's collections.defaultdict -based solution mentioned by @Padraic Cunningham 这是@Padraic Cunningham提到的基于collections.defaultdict的解决方案

from collections import defaultdict
from itertools import chain

def inner_join(a, b):
    d = defaultdict(list)
    for row in chain(a, b):
        d[row[0]].append(row[1:])
    for id, rows in d.iteritems():
        if len(rows) > 1:
            assert len(rows) == 2
            yield (id,) + rows[0] + rows[1]

Have you used pandas before? 你以前用过熊猫吗? This seems to give your desired output: 这似乎给你想要的输出:

n [41]:
import pandas as pd
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

A = pd.DataFrame(listA)
B = pd.DataFrame(listB)

A.merge(B, on=0)
Out[41]:
    0   1_x     2_x     3_x     4_x     1_y     2_y     3_y     4_y
0   123     a1  a2  123     789     f1  Y   f2  2
1   125     c1  c2  156     852     d1  N   d2  1

`A' and 'B' are pandas dataframes which have some of the SQL like functionality built into them, such as merge. “A”和“B”是pandas数据帧,它们内置了一些类似SQL的功能,例如merge。 If you haven't used pandas, let me know if you need further explanation. 如果您还没有使用过熊猫,请告诉我您是否需要进一步解释。

See Database-style DataFrame joining/merging . 请参阅数据库样式的DataFrame连接/合并

You can group by the first element using an OrderedDict, appending each tuple then only keep and join the tuples where the value list has a length > 1: 您可以使用OrderedDict按第一个元素进行分组,附加每个元组然后只保留并加入值列表长度> 1的元组:

from itertools import chain
from collections import OrderedDict

od = OrderedDict()

for ele in chain(listA,listB):
    od.setdefault(ele[0], []).append(ele[1:])

print([(k,) + tuple(chain.from_iterable(v)) for k,v in od.iteritems() if len(v) > 1])

Output: 输出:

[('123', 'a1', 'a2', 123, 789, 'f1', 'Y', 'f2', 2), ('125', 'c1', 'c2', 156, 852, 'd1', 'N', 'd2', 1)]

If order does not matter a collections.defaultdict will be faster, either way this will be significantly faster than your own approach. 如果顺序并不重要,一个collections.defaultdict会更快,无论哪种方式,这将是比你自己的方法显著更快。

Or storing itertools.islice objects using a flag to find matched keys: 或者使用标志存储itertools.islice对象以查找匹配的键:

from itertools import chain, islice
from collections import OrderedDict

od = OrderedDict()

for ele in chain(listA, listB):
    k = ele[0]
    if k in od:
        od[k]["v"].append(islice(ele, 1, None))
        od[k]["flag"] = True
    else:
        od.setdefault(k, {"flag": False, "v": []})["v"].append(islice(ele, 1, None))

print([(k,) + tuple(chain.from_iterable(v["v"])) for k, v in od.items() if v["flag"]])

Output: 输出:

[('123', 'a1', 'a2', 123, 789, 'f1', 'Y', 'f2', 2), ('125', 'c1', 'c2', 156, 852, 'd1', 'N', 'd2', 1)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM