简体   繁体   中英

How do I merge two lists of tuples based on a key?

I have two lists of tuples that I need to merge. This would be comparable to a JOIN in database terms. The order of the tuples in each list may change. The order of the items in the tuple will not change. The count of items in A should equal the count in B but there may be a difference.

Here are my two lists of tuples. There will be 10,000+ of these tuples in each list so performance is a concern. The first element in each tuple is the key common to each list.

listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

The desired output is:

listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)]

Here's the code that I threw together for testing the concept. It works but as you can see, performance is an issue. The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete.

Here's the code:

for row in listA:
    for item in listB:
        if item[0] == row[0]:
            item = list(item)
            del item[0]
            row = list(row)
            merged.append(tuple(row + item))

How can I merge / join the two lists and achieve better performance?

Inner join two lists of tuples on the first (unique in each list) column using itertools.groupby() suggested by @CoryKramer in the comments :

from itertools import groupby
from operator import itemgetter

def inner_join(a, b):
    L = a + b
    L.sort(key=itemgetter(0)) # sort by the first column
    for _, group in groupby(L, itemgetter(0)):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Example:

result = list(inner_join(listA, listB))
assert result == listC

This solution has O(n*log n) time complexity (your solution (in the question) is O(n*n) that is much worse for n ~ 10000 ).

It doesn't matter for a small n such as 10**4 in the question but in Python 3.5+ you could use heapq.merge() with key parameter to avoid allocating new list ie, for O(1) constant memory solution:

from heapq import merge # merge has key parameter in Python 3.5

def inner_join(a, b):
    key = itemgetter(0)
    a.sort(key=key) 
    b.sort(key=key)
    for _, group in groupby(merge(a, b, key=key), key):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Here's a dict-based solution. It is O(n) linear in time and space algorithm:

def inner_join(a, b):
    d = {}
    for row in b:
        d[row[0]] = row
    for row_a in a:
        row_b = d.get(row_a[0])
        if row_b is not None: # join
            yield row_a + row_b[1:]

Here's collections.defaultdict -based solution mentioned by @Padraic Cunningham

from collections import defaultdict
from itertools import chain

def inner_join(a, b):
    d = defaultdict(list)
    for row in chain(a, b):
        d[row[0]].append(row[1:])
    for id, rows in d.iteritems():
        if len(rows) > 1:
            assert len(rows) == 2
            yield (id,) + rows[0] + rows[1]

Have you used pandas before? This seems to give your desired output:

n [41]:
import pandas as pd
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

A = pd.DataFrame(listA)
B = pd.DataFrame(listB)

A.merge(B, on=0)
Out[41]:
    0   1_x     2_x     3_x     4_x     1_y     2_y     3_y     4_y
0   123     a1  a2  123     789     f1  Y   f2  2
1   125     c1  c2  156     852     d1  N   d2  1

`A' and 'B' are pandas dataframes which have some of the SQL like functionality built into them, such as merge. If you haven't used pandas, let me know if you need further explanation.

See Database-style DataFrame joining/merging .

You can group by the first element using an OrderedDict, appending each tuple then only keep and join the tuples where the value list has a length > 1:

from itertools import chain
from collections import OrderedDict

od = OrderedDict()

for ele in chain(listA,listB):
    od.setdefault(ele[0], []).append(ele[1:])

print([(k,) + tuple(chain.from_iterable(v)) for k,v in od.iteritems() if len(v) > 1])

Output:

[('123', 'a1', 'a2', 123, 789, 'f1', 'Y', 'f2', 2), ('125', 'c1', 'c2', 156, 852, 'd1', 'N', 'd2', 1)]

If order does not matter a collections.defaultdict will be faster, either way this will be significantly faster than your own approach.

Or storing itertools.islice objects using a flag to find matched keys:

from itertools import chain, islice
from collections import OrderedDict

od = OrderedDict()

for ele in chain(listA, listB):
    k = ele[0]
    if k in od:
        od[k]["v"].append(islice(ele, 1, None))
        od[k]["flag"] = True
    else:
        od.setdefault(k, {"flag": False, "v": []})["v"].append(islice(ele, 1, None))

print([(k,) + tuple(chain.from_iterable(v["v"])) for k, v in od.items() if v["flag"]])

Output:

[('123', 'a1', 'a2', 123, 789, 'f1', 'Y', 'f2', 2), ('125', 'c1', 'c2', 156, 852, 'd1', 'N', 'd2', 1)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM