簡體   English   中英

從列表中刪除與子字符串匹配的項目的最快方法 - Python

[英]The fastest way to remove items that matches a substring from list - Python

刪除列表中與集合中的子字符串匹配的項目的最快方法是什么?

例如,

the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
 'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
 'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
 'Trumps career',
 'branding efforts',
 'personal life',
 'and outspoken manner have made him a celebrity.',
 'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
 'While still attending college he worked for his fathers firm',
 'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
 'and in 1971 was given control, renaming the company The Trump Organization.',
 'Since then he has built hotels',
 'casinos',
 'golf courses',
 'and other properties',
 'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']

該列表實際上比這個(數百萬個字符串元素)要長很多,並且我想刪除包含該集合中字符串的任何元素,例如,

{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"} 

什么是最快的方式? 循環最快?

如果你的字符串已經在內存中,請使用列表解析:

new = [line for line in the_list if not any(item in line for item in set_of_words)]

如果在內存使用方面沒有將它們作為更優化的方法在內存中使用,則可以使用生成器表達式:

new = (line for line in the_list if not any(item in line for item in set_of_words))

Aho-Corasick算法專門針對此任務而設計。 它具有比嵌套循環O(n * m)低得多的時間復雜度O(n + m)的明顯優點,其中n是要查找的字符串的數量,m是要搜索的字符串的數量。

有一個很好的Python實現Aho-Corasick及附帶的解釋。 Python Package Index也有一些實現,但我沒有看過它們。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM