简体   繁体   English

从列表中删除与子字符串匹配的项目的最快方法 - Python

[英]The fastest way to remove items that matches a substring from list - Python

What is the fastest way to remove items in the list that matches substrings in the set? 删除列表中与集合中的子字符串匹配的项目的最快方法是什么?

For example, 例如,

the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
 'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
 'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
 'Trumps career',
 'branding efforts',
 'personal life',
 'and outspoken manner have made him a celebrity.',
 'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
 'While still attending college he worked for his fathers firm',
 'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
 'and in 1971 was given control, renaming the company The Trump Organization.',
 'Since then he has built hotels',
 'casinos',
 'golf courses',
 'and other properties',
 'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']

The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example, 该列表实际上比这个(数百万个字符串元素)要长很多,并且我想删除包含该集合中字符串的任何元素,例如,

{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"} 

What will be the fastest way? 什么是最快的方式? Is Looping through the fastest? 循环最快?

Use a list comprehension if you have your strings already in memory: 如果你的字符串已经在内存中,请使用列表解析:

new = [line for line in the_list if not any(item in line for item in set_of_words)]

If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression: 如果在内存使用方面没有将它们作为更优化的方法在内存中使用,则可以使用生成器表达式:

new = (line for line in the_list if not any(item in line for item in set_of_words))

The Aho-Corasick algorithm was specifically designed for exactly this task. Aho-Corasick算法专门针对此任务而设计。 It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched. 它具有比嵌套循环O(n * m)低得多的时间复杂度O(n + m)的明显优点,其中n是要查找的字符串的数量,m是要搜索的字符串的数量。

There is a good Python implementation of Aho-Corasick with accompanying explanation. 有一个很好的Python实现Aho-Corasick及附带的解释。 There are also a couple of implementations at the Python Package Index but I've not looked at them. Python Package Index也有一些实现,但我没有看过它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从列表中删除所有多个事件项的最快方法? - Fastest way to remove all multiple occurrence items from a list? 从Python中的对象列表中删除对象的最快或最常用的方法 - Fastest or most idiomatic way to remove object from list of objects in Python 从 Python 中的列表中删除列表子集的最快方法 - Fastest way to remove subsets of lists from a list in Python 在Python中删除列表列表中重复项的最快方法? - Fastest way to remove duplicates in list of lists in Python? 在 Python 中找到“startswith”substring 的最快方法 - Fastest way in Python to find a 'startswith' substring in a long sorted list of strings Python:从列表中删除不可转换为int的项的干净高效方法 - Python : Clean and efficient way to remove items that are not convertable to int from list 检查列表中是否正好有n个项目与python中的条件匹配的最快方法 - Fastest way to check if exactly n items in a list match a condition in python Python - 从列表中的字符串元素中删除子字符串? - Python - Remove substring from string element in a list? 从Python列表中删除特定条目的所有实例的最快方法是什么? - What is the fastest way to remove all instances of a particular entry from a list in Python? 如果 python 中的关键字匹配,如何迭代列表并删除一些项目 - How to iterate on a list and remove some items if keyword matches in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM