简体   繁体   中英

Sort-Merge-Join Algorithm in Python

My code:

 def merge_join(self, outer, outer_join_index, inner, inner_join_index):
    a=list(inner)
    b=list(outer)
    if not a or not b:
        return
    inner_copy = sorted(a,key=lambda tup: tup[inner_join_index])
    outer_copy = sorted(b,key=lambda tup: tup[outer_join_index])
    inner_counter=0
    outer_counter=0
    while inner_counter < len(inner_copy) and outer_counter < len(outer_copy):
        if outer_copy[outer_counter][outer_join_index]==inner_copy[inner_counter][inner_join_index]:
            yield outer_copy[outer_counter]+inner_copy[inner_counter]
            outer_counter+=1
        elif outer_copy[outer_counter][outer_join_index]<inner_copy[inner_counter][inner_join_index]:
            outer_counter+=1
        else:
            inner_counter+=1

Where outer and inner are generators.

I ran a given test for the algorithm but it returned a generator of 127 elements as opposed to the expected number 214. Can anyone help me check where the bug might be in my code? Thank you!!

If you want to pick a correct outer row for each inner row (without duplicates in inner and skipping rows if there's no match then in case of match you are supposed to increment inner_counter , not outer_counter like you are doing.

The reason is that otherwise if multiple inner rows have the same value you will only output the first of them.

If instead you want to do a full join (producing all the cartesian product of rows from inner and outer for a given value of the join column) then this has to be coded explicitly with something like

while inner_counter < len(inner_copy) and outer_counter < len(outer_copy):
    key = min(inner_copy[inner_index][inner_join_index],
              outer_copy[outer_index][outer_join_index])
    inner_group = []
    while inner_index < len(inner) and key == inner_copy[inner_index][inner_join_index]:
        inner_group.append(inner_copy[inner_index])
        inner_index += 1
    outer_group = []
    while outer_index < len(outer) and key == outer_copy[outer_index][outer_join_index]:
        outer_group.append(outer_copy[outer_index])
        outer_index += 1
    # Here you can handle left or right join by replacing an
    # empty group with a group of one empty row (None,)*len(row)
    for i in inner_group:
        for o in outer_group:
            yield i + o

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM