[英]Remove a string that is a substring of other string in the list WITHOUT changing original order of the list?
I have a list. 我有一个清单。
the_list = ['Donald Trump has', 'Donald Trump has small fingers', 'What is going on?']
I'd like to remove "Donald Trump has" from the_list
because it's a substring of other list element. 我想从the_list
删除“ Donald Trump has”,因为它是其他list元素的子字符串。
Here is an important part. 这是重要的部分。 I want to do this without distoring the order of the original list. 我想这样做而不会扭曲原始列表的顺序。
The function I have (below) distorts the order of the original list. 我具有的功能(如下)会扭曲原始列表的顺序。 Because it sorts the list items by its length first. 因为它首先按其长度对列表项进行排序。
def substr_sieve(list_of_strings):
dups_removed = list_of_strings[:]
for i in xrange(len(list_of_strings)):
list_of_strings.sort(key = lambda s: len(s))
j=0
j=i+1
while j <= len(list_of_strings)-1:
if list_of_strings[i] in list_of_strings[j]:
try:
dups_removed.remove(list_of_strings[i])
except:
pass
j+=1
return dups_removed
A simple solution. 一个简单的解决方案。
But first, let's also add ' Donald Trump ', 'Donald' and 'Trump' in the end to make it a better test case. 但是首先,我们最后还要添加“ Donald Trump ”, “ Donald”和“ Trump” ,以使其成为更好的测试用例。
>>> forbidden_text = "\nX08y6\n" # choose a text that will hardly appear in any sensible string
>>> the_list = ['Donald Trump has', 'Donald Trump has small fingers', 'What is going on?',
'Donald Trump', 'Donald', 'Trump']
>>> new_list = [item for item in the_list if forbidden_text.join(the_list).count(item) == 1]
>>> new_list
['Donald Trump has small fingers', 'What is going on?']
Logic: 逻辑:
forbidden_text.join(the_list)
. forbidden_text.join(the_list)
。 count(item) == 1
str.count(sub[, start[, end]]) str.count(sub [,start [,end]])
Return the number of non-overlapping occurrences of substring
sub
in the range[start, end]
. 返回范围为[start, end]
的子字符串sub
的不重叠出现的次数。 Optional argumentsstart
andend
are interpreted as in slice notation. 可选参数start
和end
解释为切片表示法。
forbidden_text
is used instead of ""
(blank string), to handle a case like these : forbidden_text
代替""
(空白字符串)来处理以下情况:
>>> the_list = ['DonaldTrump', 'Donald', 'Trump']
As correctly pointed by Nishant, above code fails for the_list = ['Donald', 'Donald']
正如Nishant所正确指出的,上述代码对于the_list = ['Donald', 'Donald']
失败
Using a set(the_list)
instead of the_list
solves the problem. 使用set(the_list)
代替the_list
解决了该问题。
>>> new_list = [item for item in the_list if forbidden_text.join(set(the_list)).count(item) == 1]
You can do this without sorting: 您可以执行此操作而无需排序:
the_list = ['Donald Trump has', "I've heard Donald Trump has small fingers",
'What is going on?']
def winnow(a_list):
keep = set()
for item in a_list:
if not any(item in other for other in a_list if item != other):
keep.add(item)
return [ item for item in a_list if item in keep ]
winnow(the_list)
Sorting may allow fewer comparisons overall, but that seems highly data-dependent, and could be a premature optimization. 排序可能总体上允许较少的比较,但这似乎与数据高度相关,并且可能是过早的优化。
You can just recursively reduce the items. 您可以递归地减少项目。
Algorithm: 算法:
Loop over each item by popping it, decide if it needs to be kept or not. 通过弹出每个项目来循环遍历,确定是否需要保留。 Call the same function recursively with the reduced list. 用精简列表递归调用相同的函数。 Base condition is if the list has at-least one item (or two?). 基本条件是列表中至少有一项(或两项?)。
Efficiency: It might not be the most efficient. 效率:可能不是最有效的。 I think some Divide and Conquer methods would be more apt? 我认为一些分而治之的方法会更合适吗?
the_list = ['Donald Trump has', 'Donald Trump has small fingers',\
'What is going on?']
final_list = []
def remove_or_append(input):
if len(input):
first_value = input.pop(0)
found = False
for each in input:
if first_value in each:
found = True
break
else:
continue
for each in final_list:
if first_value in each:
found = True
break
else:
continue
if not found:
final_list.append(first_value)
remove_or_append(input)
remove_or_append(the_list)
print(final_list)
A slightly different version is: 稍有不同的版本是:
def substring_of_anything_else(item, list):
for idx, each in enumerate(list):
if idx == item[0]:
continue
else:
if item[1] in each:
return True
return False
final_list = [item for idx, item in enumerate(the_list)\
if not substring_of_anything_else((idx, item), the_list)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.