简体   繁体   中英

The fastest way to remove items that matches a substring from list - Python

What is the fastest way to remove items in the list that matches substrings in the set?

For example,

the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
 'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
 'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
 'Trumps career',
 'branding efforts',
 'personal life',
 'and outspoken manner have made him a celebrity.',
 'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
 'While still attending college he worked for his fathers firm',
 'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
 'and in 1971 was given control, renaming the company The Trump Organization.',
 'Since then he has built hotels',
 'casinos',
 'golf courses',
 'and other properties',
 'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']

The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,

{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"} 

What will be the fastest way? Is Looping through the fastest?

Use a list comprehension if you have your strings already in memory:

new = [line for line in the_list if not any(item in line for item in set_of_words)]

If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:

new = (line for line in the_list if not any(item in line for item in set_of_words))

The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.

There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM