简体   繁体   English

如果dict键的值在otherlist中,则从列表中的dicts中删除元素

[英]Delete elements from dicts in list if value of dict key is in otherlist

I have code like the following: 我有以下代码:

dicts = [
        {'one': 'hello',
         'two': 'world',
         'three': ['a', 'b', 'c', 'd'],
         'four': 'foo'
        },
        {'one': 'pizza',
         'two': 'cake',
         'three': ['f', 'g', 'h', 'e'],
         'four': 'bar'
        }
       ]

letters = ['q', 'w', 'e', 'r','t','y']

dedup_rows = [row for row in dicts if row['three'][3] not in letters]

The objective is that dedup_rows should contain the elements from dicts in which the fourth element of the list stored in three is not contained in the list letters . 目的是dedup_rows应包含的元素dicts ,其中存储在所述列表中的所述第四元件three不包含在列表中letters Essentially, delete row from dicts if row['three'][3] in letters . 基本上, delete row from dicts if row['three'][3] in letters The output from the above code would be: 上面代码的输出将是:

dedup_rows: [
             {'one': 'hello',
              'two': 'world',
              'three': ['a', 'b', 'c', 'd'],
              'four': 'foo'
             }
            ]

The code I have is working but in practice both dicts and letters contain hundreds of thousands of elements each and so execution is slow as each iteration over dicts also requires a full iteration over letters . 我的代码是工作,但在实践中,两种dictsletters包含数十万个元素的每所以执行是缓慢的,因为每个迭代超过dicts也需要在一个完整的迭代letters

Is there a more optimal way of doing this in Python? 在Python中有更好的方法吗?

Your code dedup_rows = [row for row in dicts if row['three'][3] not in letters] is of square complexity. 你的代码dedup_rows = [row for row in dicts if row['three'][3] not in letters]是方形复杂度。 since it is iterating over dicts and on letters for each element of dicts . 因为它是迭代dictsletters的每个元素dicts
If both of your lists contain a large number of elements. 如果两个列表都包含大量元素。 You should consider data structure with lookup time complexity of the order of one. 您应该考虑具有大约一的查找时间复杂度的数据结构。 For your case Python Sets are perfect. 对于您的情况, Python集是完美的。 You can read more about it. 您可以阅读更多相关信息。
All you need to do is convert letters = ['q', 'w', 'e', 'r','t','y'] to a set with syntax set(letters) and find with syntax x in letters_set . 您需要做的就是将letters = ['q', 'w', 'e', 'r','t','y']转换为具有语法set(letters)的集合,并x in letters_set使用语法x in letters_set

dicts = [
    {'one': 'hello',
     'two': 'world',
     'three': ['a', 'b', 'c', 'd'],
     'four': 'foo'
    },
    {'one': 'pizza',
     'two': 'cake',
     'three': ['f', 'g', 'h', 'e'],
     'four': 'bar'
    }
   ]

letters = ['q', 'w', 'e', 'r','t','y']
letters_set = set(letters)

dedup_rows = [row for row in dicts if row['three'][3] not in letters_set]

like this you can change the algorithm from order of n square to order of n. 像这样你可以将算法从n平方的顺序改为n的顺序。

If you are really dealing with hundreds of thousands of records with rows with hundreds of thousands of values each, then perhaps a pure in memory python approach is not the best way forward. 如果你真的在处理数十万条记录,每行记录的行数都是数十万,那么也许纯粹的内存python方法并不是最好的方法。

There are a few things you can do that will improve performance: 您可以做一些可以提高性能的方法:

  • Stream in records from your source (file? database?) instead of loading them at once 从源(文件?数据库?)中流式传输记录,而不是一次加载它们
  • Use a generator which reads the records one at a time and then yields them if they do or don't match (never keep them in a list) 使用一次一个读取记录的生成器,如果它们匹配或不匹配则生成它们(永远不要将它们保存在列表中)
  • Use sets for set comparisons which will be a lot faster for many values 使用集合进行集合比较,对于许多值来说,这将更快

In general though, this begs the question of where you are getting these records from? 但总的来说,这引出了从哪里获取这些记录的问题?

If they are stored in any kind of database, then performing a query at source which rules out the rows you don't want, and provides a cursor to iterate through the rows you do want in a memory efficient way sounds like a better way to go. 如果它们存储在任何类型的数据库中,那么在源处执行查询以排除您不想要的行,并提供游标以内存有效的方式迭代您想要的行听起来像是一种更好的方法走。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM