What is a better approach to filter these two lists into one?

Question

Background

I have two lists, the first is items which contains around 250 tuples, each tuple contains 3 elements

(path_to_a_file, size_in_bytes, modified_time)

The second list, result contains anywhere up to 250 elements, which is the result of a database query which looks up rows based on the paths that are in the items list. The number of elements in result depends if those files are in the database already.

each element in result is an row object returned from SQLAlchemy query with attributes for the row values, ( path , mtime and hash are the ones I'm interested in here)

What I'm trying and do is filter out all the elements in items that are in results that have the same mtime (and keep track of the number, and total size filtered) and make a new list with items either with a different mtime or that dont exist in result . items with different mtimes need to be stored (path,size,mtime_from_result,hash_from_result) and items which weren't in the database (path,size,mtime,None) .

I hope I'm not making this too localised but I felt I needed to explain what I'm trying to accomplish to ask the question.

Problem

I want to try and make this loop as fast as possible but the most important part is making it work as expected.

Is it safe to remove items from the lists as I iterate over them? I noticed iterating forwards has a weird outcome but iterating backwards seems to be ok. Is there a better approach?

I'm removing items that I've matched up ( i.path == j[0] ) because I know the relationship is 1 to 1 and its not going to match again so by reducing the lists I can iterate over it faster in the next iteration, and more importantly I get left with all the unmatched items.

I can't help feel there's a much nicer solution that I'm overlooking, perhaps with list comprehension or generators perhaps.

send_items=[]
for i in result[::-1]:
    for j in items[::-1]:
        if i.path==j[0]:
            result.remove(i) #I think this remove is possibly pointless?
            items.remove(j)
            if i.mtime==j[2]:
                self.num_skipped+=1
                self.size_skipped+=j[1]
            else:
                send_items.append((j[0],j[1],i.mtime,i.hash))
            break
send_items.extend(((j[0],j[1],j[2],None) for j in items))

Answer 1

I'd do this as:

def get_send_items(items, results):
    send_items = []
    results_dict = {i.path:i for i in results}
    for p, s, m in items:
        result = results_dict.get(p)
        if result is None:
            send_items.append((p, s, m, None))
        elif result.mtime != m:
            send_items.append((p, s, result.mtime, result.hash))
    return send_items

Here is my analysis of your solution (Assuming both result and items are of length N):

result[::-1] creates a copy of result so calling result.remove(i) doesn't affect the iteration, nor would it have anyways. You only loop over result once, so removing elements is a bit pointless. It only creates extra work.
You could have called result[::] to create a copy of result .
Calling items.remove(j) actually reduces efficiency. remove() takes O(N) time. So calling it reduces the algorithm's efficiency to O(N^3) from O(N^2).
By using O(N) extra memory (as in my solution) you can reduce the run time to O(N), if you use a dictionary or a set that has O(1) look ups.

Answer 2

First of all, I am assuming that the file path identifies a file - that they are unique.

We create a dictionary of the results, so we can easily check for membership, and check the values associated with it.

dict_results = {file: (size, modified_time) for file, size, modified_time in results}

We can then use a list comprehension to filter out the elements you don't want:

[(file, size, modified_time) for file, size, modified_time in items if (file not in dict_results) or (not dict_results[file][1] == modified_time)]

Eg:

>>> results = [(1, 1, 1), (2, 2, 3)]
>>> items = [(1, 1, 1), (2, 2, 2), (3, 3, 3)]
>>> dict_results = {file: (size, modified_time) for file, size, modified_time in results}
>>> [(file, size, modified_time) for file, size, modified_time in items if (file not in dict_results) or (not dict_results[file][1] == modified_time)]
[(2, 2, 2), (3, 3, 3)]

Answer 3

How about inserting the results into a set as Marcin suggests, and using a list comprehension to filter the items:

mtimes_set = set(result[2] for result in results)
send_items = (item for item in items if item[2] not in mtimes_set)

Misunderstood the path part. This can still be done (although a bit ugly around the last set of brackets):

path_dict = dict((result[0], result) for result in results)
send_items = (item for item in items if item[0] in path_dict and path_dict[item[0]][2] != item[2])

Here I am creating a dictionary of seen paths, then a generator returning those which have a path in the dict, and that have the different mtime. It could easily be changes to return the path_dict result item instead here instead of item now.

Answer 4

First stab:

items_dict = dict( (el[0], el[1:]) for el in items )
new = []
modified = []
other = []
for res in result:
    put_to = None
    item = items_dict.get(res.path, (None, None))
    if item is (None, None):
        put_to = new
    elif res.mtime != item[1]:
        put_to = modified
    else:
        put_to = other
    put_to.append( (res.path, item) )

What is a better approach to filter these two lists into one?

Question

Background

Problem

4 answers

solution1
1 ACCPTED 2012-06-21 14:34:05

solution2
0 2012-06-21 14:31:22

solution3
0 2012-06-21 14:32:57

solution4
0 2012-06-21 14:39:04

What is a better approach to filter these two lists into one?

Question

Background

Problem

4 answers

solution1 1 ACCPTED 2012-06-21 14:34:05

solution2 0 2012-06-21 14:31:22

solution3 0 2012-06-21 14:32:57

solution4 0 2012-06-21 14:39:04

solution1
1 ACCPTED 2012-06-21 14:34:05

solution2
0 2012-06-21 14:31:22

solution3
0 2012-06-21 14:32:57

solution4
0 2012-06-21 14:39:04