Python: Efficient way of matching slices of strings between two lists

Question

Let's say I have two lists of files with similar names like so:

images = ['image_im_1', 'image_im_2']
masks = ['mask_im_1', 'mask_im_2', 'mask_im_3']

How would I be able to efficiently remove elements that aren't matching? I want to get the following:

images = ['image_im_1', 'image_im_2']
masks = ['mask_im_1', 'mask_im_2']

I've tried doing the following:

setA = set([x[-4:] for x in images])
setB = set([x[-4:] for x in masks])

matches = setA.union(setB)

elems = list(matches)

for elem in elems:
    result = [x for x in images if x.endswith(elem)]

But this is rather naïve and slow as I need to iterate through a list of ~100k elements. Any idea how I can effectively implement this?

Answer 1

First of all, since you want the common endings, you should use intersection, not union:

matches = setA.intersection(setB)

Then matches is already a set, so instead of converting it to a list and loop over it, loop over images and masks and check for set membership.

imgres = [x for x in images if x[-4:] in matches]
mskres = [x for x in masks if x[-4:] in matches]

Answer 2

Your solution is basically as good as it gets, you can improve it to just a single run through though if you store an intermediate map image_map

# store dict of mapping to original name
image_map = {x[-4:]: x for x in images}

# store all our matches here
matches = []

# loop through your other file names
for mask in masks:

    # if this then we have a match!
    if mask[-4:] in image_map:

        # save the mask
        matches.append(mask)

        # get the original image name
        matches.append(image_map[mask[-4:]])

Python: Efficient way of matching slices of strings between two lists

Question

2 answers

solution1
1 2022-06-01 16:28:14

solution2
1 2022-06-01 16:28:29

Python: Efficient way of matching slices of strings between two lists

Question

2 answers

solution1 1 2022-06-01 16:28:14

solution2 1 2022-06-01 16:28:29

solution1
1 2022-06-01 16:28:14

solution2
1 2022-06-01 16:28:29