I have two lists of tuples that I need to merge. This would be comparable to a JOIN in database terms. The order of the tuples in each list may change. The order of the items in the tuple will not change. The count of items in A should equal the count in B but there may be a difference.
Here are my two lists of tuples. There will be 10,000+ of these tuples in each list so performance is a concern. The first element in each tuple is the key common to each list.
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]
The desired output is:
listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)]
Here's the code that I threw together for testing the concept. It works but as you can see, performance is an issue. The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete.
Here's the code:
for row in listA:
for item in listB:
if item[0] == row[0]:
item = list(item)
del item[0]
row = list(row)
merged.append(tuple(row + item))
How can I merge / join the two lists and achieve better performance?
Inner join two lists of tuples on the first (unique in each list) column using itertools.groupby()
suggested by @CoryKramer in the comments :
from itertools import groupby
from operator import itemgetter
def inner_join(a, b):
L = a + b
L.sort(key=itemgetter(0)) # sort by the first column
for _, group in groupby(L, itemgetter(0)):
row_a, row_b = next(group), next(group, None)
if row_b is not None: # join
yield row_a + row_b[1:] # cut 1st column from 2nd row
Example:
result = list(inner_join(listA, listB))
assert result == listC
This solution has O(n*log n)
time complexity (your solution (in the question) is O(n*n)
that is much worse for n ~ 10000
).
It doesn't matter for a small n
such as 10**4
in the question but in Python 3.5+ you could use heapq.merge()
with key
parameter to avoid allocating new list ie, for O(1)
constant memory solution:
from heapq import merge # merge has key parameter in Python 3.5
def inner_join(a, b):
key = itemgetter(0)
a.sort(key=key)
b.sort(key=key)
for _, group in groupby(merge(a, b, key=key), key):
row_a, row_b = next(group), next(group, None)
if row_b is not None: # join
yield row_a + row_b[1:] # cut 1st column from 2nd row
Here's a dict-based solution. It is O(n)
linear in time and space algorithm:
def inner_join(a, b):
d = {}
for row in b:
d[row[0]] = row
for row_a in a:
row_b = d.get(row_a[0])
if row_b is not None: # join
yield row_a + row_b[1:]
Here's collections.defaultdict
-based solution mentioned by @Padraic Cunningham
from collections import defaultdict
from itertools import chain
def inner_join(a, b):
d = defaultdict(list)
for row in chain(a, b):
d[row[0]].append(row[1:])
for id, rows in d.iteritems():
if len(rows) > 1:
assert len(rows) == 2
yield (id,) + rows[0] + rows[1]
Have you used pandas before? This seems to give your desired output:
n [41]:
import pandas as pd
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]
A = pd.DataFrame(listA)
B = pd.DataFrame(listB)
A.merge(B, on=0)
Out[41]:
0 1_x 2_x 3_x 4_x 1_y 2_y 3_y 4_y
0 123 a1 a2 123 789 f1 Y f2 2
1 125 c1 c2 156 852 d1 N d2 1
`A' and 'B' are pandas dataframes which have some of the SQL like functionality built into them, such as merge. If you haven't used pandas, let me know if you need further explanation.
You can group by the first element using an OrderedDict, appending each tuple then only keep and join the tuples where the value list has a length > 1:
from itertools import chain
from collections import OrderedDict
od = OrderedDict()
for ele in chain(listA,listB):
od.setdefault(ele[0], []).append(ele[1:])
print([(k,) + tuple(chain.from_iterable(v)) for k,v in od.iteritems() if len(v) > 1])
Output:
[('123', 'a1', 'a2', 123, 789, 'f1', 'Y', 'f2', 2), ('125', 'c1', 'c2', 156, 852, 'd1', 'N', 'd2', 1)]
If order does not matter a collections.defaultdict
will be faster, either way this will be significantly faster than your own approach.
Or storing itertools.islice
objects using a flag to find matched keys:
from itertools import chain, islice
from collections import OrderedDict
od = OrderedDict()
for ele in chain(listA, listB):
k = ele[0]
if k in od:
od[k]["v"].append(islice(ele, 1, None))
od[k]["flag"] = True
else:
od.setdefault(k, {"flag": False, "v": []})["v"].append(islice(ele, 1, None))
print([(k,) + tuple(chain.from_iterable(v["v"])) for k, v in od.items() if v["flag"]])
Output:
[('123', 'a1', 'a2', 123, 789, 'f1', 'Y', 'f2', 2), ('125', 'c1', 'c2', 156, 852, 'd1', 'N', 'd2', 1)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.