简体   繁体   English

使用单词列表python删除字典中的值

[英]Removing values in Dictionary with a list of words python

Let's say I have a list of words 假设我有一个单词列表

 nottastyfruits = ['grape', 'orange', 'durian', 'pear']

 fruitGroup = {'001': ['grape','apple', 'jackfruit', 'orange', 'Longan'],
               '002': ['apple', 'watermelon', 'pear']}

I want to go through all the keys in the dictionary and remove the words from nottastyfruits list. 我想浏览一下字典中的所有键,然后从nottastyfruits列表中删除单词。

My current code is 我当前的代码是

finalfruits = {}
for key, value in fruitGroup.items():
    fruits = []
    for fruit in value:
        if fruit not in nottastyfruits:
            fruits.append(fruit)
    finalfruits[key] = (fruits)

This takes so long to run when you have a large data text such as large text preprocessing. 当您具有大数据文本(例如大文本预处理)时,这将花费很长时间。 Is there a more efficient and faster way to do this? 有没有更有效,更快捷的方法来做到这一点?

Thank you for you time 谢谢你的时间

You should make a set out of your fruitlist to speedup the lookups, then use a dictionary comprehension: 您应该在水果清单中进行set以加快查找速度,然后使用字典理解:

nottastyfruits = set(['grape', 'orange', 'durian', 'pear'])

fruitGroup = {'001': ['grape','apple', 'jackfruit', 'orange', 'Longan'],
           '002': ['apple', 'watermelon', 'pear']}

print {k: [i for i in v if i not in nottastyfruits] for k, v in fruitGroup.iteritems()}

>>> {'002': ['apple', 'watermelon'], '001': ['apple', 'jackfruit', 'Longan']}

Making it flat by using a dictionary comprehension will remove the overhead of the for loop. 通过使用字典理解使其平坦 ,将消除for循环的开销。

Making nottastyfruits a set will decrease lookup time: nottastyfruits设置nottastyfruits一组将减少查找时间:

nottastyfruits  = set(nottastyfruits)
finalfruits = {k: [f for f in v if f not in nottastyfruits] for k, v in fruitGroup.items()}

One low-hanging fruit, if you will, is to make nottastyfruits a set . 如果愿意的话,一种低落的水果是将nottastyfruitsset Also, you can use comprehensions to squeeze some performance out. 另外,您可以使用理解力来压缩某些性能。

In [1]: fruitGroup = {'001': ['grape','apple', 'jackfruit', 'orange', 'Longan'],
   ...:                '002': ['apple', 'watermelon', 'pear']
   ...:               }

In [2]: nottastyfruit = {'grape', 'orange', 'durian', 'pear'}

In [3]: finalfruits = {k:[f for f in v if f not in nottastyfruit] for k,v in fruitGroup.items()}

In [4]: finalfruits
Out[4]: {'001': ['apple', 'jackfruit', 'Longan'], '002': ['apple', 'watermelon']}

Since both nottastyfruits and lists in the dictionary are flat lists, you can use sets to get the difference between the two. 由于nottastyfruits和字典中的列表都是平面列表,因此可以使用集合来获取两者之间的差异。

nottastyfruits = set(['orange', 'pear', 'grape', 'durian'])
fruitGroup = {'001': ['grape','apple', 'jackfruit', 'orange', 'Longan'], '002': ['apple', 'watermelon', 'pear'] }

for key, value in fruitGroup.iteritems():
    fruitGroup[key] = list(set(value).difference(nottastyfruits))

print fruitGroup # Prints "{'002': ['watermelon', 'apple'], '001': ['jackfruit', 'apple', 'Longan']}"

Below is a benchmark of differents proposed solutions plus a solution based on the filter() function: 以下是各种提议解决方案的基准,以及基于filter()函数的解决方案:

from timeit import timeit


nottastyfruits = ['grape', 'orange', 'durian', 'pear']

fruitGroup = {'001': ['grape','apple', 'jackfruit', 'orange', 'Longan'],
              '002': ['apple', 'watermelon', 'pear']}


def fruit_filter_original(fruit_groups, not_tasty_fruits):
    final_fruits = {}
    for key, value in fruit_groups.items():
        fruits = []
        for fruit in value:
            if fruit not in not_tasty_fruits:
                fruits.append(fruit)
        final_fruits[key] = (fruits)
    return final_fruits


def fruit_filter_comprehension(fruit_groups, not_tasty_fruits):
    return {group: [fruit for fruit in fruits
                         if fruit not in not_tasty_fruits]
            for group, fruits in fruit_groups.items()}


def fruit_filter_set_comprehension(fruit_groups, not_tasty_fruits):
    not_tasty_fruits = set(not_tasty_fruits)
    return {group: [fruit for fruit in fruits
                         if fruit not in not_tasty_fruits]
            for group, fruits in fruit_groups.items()}


def fruit_filter_set(fruit_groups, not_tasty_fruits):
    return {group: list(set(fruits).difference(not_tasty_fruits))
            for group, fruits in fruit_groups.items()}


def fruit_filter_filter(fruit_groups, not_tasty_fruits):
    return {group: filter(lambda fruit: fruit not in not_tasty_fruits, fruits)
            for group, fruits in fruit_groups.items()}


print(fruit_filter_original(fruitGroup, nottastyfruits))
print(fruit_filter_comprehension(fruitGroup, nottastyfruits))
print(fruit_filter_set_comprehension(fruitGroup, nottastyfruits))
print(fruit_filter_set(fruitGroup, nottastyfruits))
print(fruit_filter_filter(fruitGroup, nottastyfruits))


print(timeit("fruit_filter_original(fruitGroup, nottastyfruits)", number=100000,
      setup="from __main__ import fruit_filter_original, fruitGroup, nottastyfruits"))
print(timeit("fruit_filter_comprehension(fruitGroup, nottastyfruits)", number=100000,
      setup="from __main__ import fruit_filter_comprehension, fruitGroup, nottastyfruits"))
print(timeit("fruit_filter_set_comprehension(fruitGroup, nottastyfruits)", number=100000,
      setup="from __main__ import fruit_filter_set_comprehension, fruitGroup, nottastyfruits"))
print(timeit("fruit_filter_set(fruitGroup, nottastyfruits)", number=100000,
      setup="from __main__ import fruit_filter_set, fruitGroup, nottastyfruits"))
print(timeit("fruit_filter_filter(fruitGroup, nottastyfruits)", number=100000,
      setup="from __main__ import fruit_filter_filter, fruitGroup, nottastyfruits"))

We can see that all solutions are NOT equal in term of performance: 我们可以看到,所有解决方案的性能都不相同:

{'001': ['apple', 'jackfruit', 'Longan'], '002': ['apple', 'watermelon']}
{'001': ['apple', 'jackfruit', 'Longan'], '002': ['apple', 'watermelon']}
{'001': ['apple', 'jackfruit', 'Longan'], '002': ['apple', 'watermelon']}
{'001': ['jackfruit', 'apple', 'Longan'], '002': ['watermelon', 'apple']}
{'001': ['apple', 'jackfruit', 'Longan'], '002': ['apple', 'watermelon']}
2.57386991159  # fruit_filter_original
2.36822144247  # fruit_filter_comprehension
2.46125930873  # fruit_filter_set_comprehension
4.09036626702  # fruit_filter_set
3.76554637862  # fruit_filter_filter

The comprehension based solution is the better but it is not a very significant improvement (with the given data at least) compared to the original code. 与原始代码相比,基于理解的解决方案更好,但并不是一个非常明显的改进(至少使用给定的数据)。 The set comprehension solution is also a small improvement. 集合理解解决方案也有一点改进。 The solutions based on filter function and set difference are quite slow... 基于滤波器功能和设置差异的解决方案相当慢...

Conclusion : If you are looking for performance, the solutions from Moses Koledoye and juanpa.arrivillaga seem to be better. 结论 :如果您正在寻找性能,Moses Koledoye和juanpa.arrivillaga的解决方案似乎更好。 However, those results could be different with bigger data, so it could be a good idea to do the test with real data. 但是,对于更大的数据,这些结果可能会有所不同,因此对真实数据进行测试可能是一个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM