简体   繁体   中英

Delete elements from dicts in list if value of dict key is in otherlist

I have code like the following:

dicts = [
        {'one': 'hello',
         'two': 'world',
         'three': ['a', 'b', 'c', 'd'],
         'four': 'foo'
        },
        {'one': 'pizza',
         'two': 'cake',
         'three': ['f', 'g', 'h', 'e'],
         'four': 'bar'
        }
       ]

letters = ['q', 'w', 'e', 'r','t','y']

dedup_rows = [row for row in dicts if row['three'][3] not in letters]

The objective is that dedup_rows should contain the elements from dicts in which the fourth element of the list stored in three is not contained in the list letters . Essentially, delete row from dicts if row['three'][3] in letters . The output from the above code would be:

dedup_rows: [
             {'one': 'hello',
              'two': 'world',
              'three': ['a', 'b', 'c', 'd'],
              'four': 'foo'
             }
            ]

The code I have is working but in practice both dicts and letters contain hundreds of thousands of elements each and so execution is slow as each iteration over dicts also requires a full iteration over letters .

Is there a more optimal way of doing this in Python?

Your code dedup_rows = [row for row in dicts if row['three'][3] not in letters] is of square complexity. since it is iterating over dicts and on letters for each element of dicts .
If both of your lists contain a large number of elements. You should consider data structure with lookup time complexity of the order of one. For your case Python Sets are perfect. You can read more about it.
All you need to do is convert letters = ['q', 'w', 'e', 'r','t','y'] to a set with syntax set(letters) and find with syntax x in letters_set .

dicts = [
    {'one': 'hello',
     'two': 'world',
     'three': ['a', 'b', 'c', 'd'],
     'four': 'foo'
    },
    {'one': 'pizza',
     'two': 'cake',
     'three': ['f', 'g', 'h', 'e'],
     'four': 'bar'
    }
   ]

letters = ['q', 'w', 'e', 'r','t','y']
letters_set = set(letters)

dedup_rows = [row for row in dicts if row['three'][3] not in letters_set]

like this you can change the algorithm from order of n square to order of n.

If you are really dealing with hundreds of thousands of records with rows with hundreds of thousands of values each, then perhaps a pure in memory python approach is not the best way forward.

There are a few things you can do that will improve performance:

  • Stream in records from your source (file? database?) instead of loading them at once
  • Use a generator which reads the records one at a time and then yields them if they do or don't match (never keep them in a list)
  • Use sets for set comparisons which will be a lot faster for many values

In general though, this begs the question of where you are getting these records from?

If they are stored in any kind of database, then performing a query at source which rules out the rows you don't want, and provides a cursor to iterate through the rows you do want in a memory efficient way sounds like a better way to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM