I have code like the following:
dicts = [
{'one': 'hello',
'two': 'world',
'three': ['a', 'b', 'c', 'd'],
'four': 'foo'
},
{'one': 'pizza',
'two': 'cake',
'three': ['f', 'g', 'h', 'e'],
'four': 'bar'
}
]
letters = ['q', 'w', 'e', 'r','t','y']
dedup_rows = [row for row in dicts if row['three'][3] not in letters]
The objective is that dedup_rows
should contain the elements from dicts
in which the fourth element of the list stored in three
is not contained in the list letters
. Essentially, delete row from dicts if row['three'][3] in letters
. The output from the above code would be:
dedup_rows: [
{'one': 'hello',
'two': 'world',
'three': ['a', 'b', 'c', 'd'],
'four': 'foo'
}
]
The code I have is working but in practice both dicts
and letters
contain hundreds of thousands of elements each and so execution is slow as each iteration over dicts
also requires a full iteration over letters
.
Is there a more optimal way of doing this in Python?
Your code dedup_rows = [row for row in dicts if row['three'][3] not in letters]
is of square complexity. since it is iterating over dicts
and on letters
for each element of dicts
.
If both of your lists contain a large number of elements. You should consider data structure with lookup time complexity of the order of one. For your case Python Sets are perfect. You can read more about it.
All you need to do is convert letters = ['q', 'w', 'e', 'r','t','y']
to a set with syntax set(letters)
and find with syntax x in letters_set
.
dicts = [
{'one': 'hello',
'two': 'world',
'three': ['a', 'b', 'c', 'd'],
'four': 'foo'
},
{'one': 'pizza',
'two': 'cake',
'three': ['f', 'g', 'h', 'e'],
'four': 'bar'
}
]
letters = ['q', 'w', 'e', 'r','t','y']
letters_set = set(letters)
dedup_rows = [row for row in dicts if row['three'][3] not in letters_set]
like this you can change the algorithm from order of n square to order of n.
If you are really dealing with hundreds of thousands of records with rows with hundreds of thousands of values each, then perhaps a pure in memory python approach is not the best way forward.
There are a few things you can do that will improve performance:
In general though, this begs the question of where you are getting these records from?
If they are stored in any kind of database, then performing a query at source which rules out the rows you don't want, and provides a cursor to iterate through the rows you do want in a memory efficient way sounds like a better way to go.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.