是否在不更改列表原始顺序的情况下删除列表中其他字符串的子字符串的字符串？

Question

I have a list. 我有一个清单。

the_list = ['Donald Trump has', 'Donald Trump has small fingers', 'What is going on?']

I'd like to remove "Donald Trump has" from the_list because it's a substring of other list element. 我想从the_list删除“ Donald Trump has”，因为它是其他list元素的子字符串。

Here is an important part. 这是重要的部分。 I want to do this without distoring the order of the original list. 我想这样做而不会扭曲原始列表的顺序。

The function I have (below) distorts the order of the original list. 我具有的功能（如下）会扭曲原始列表的顺序。 Because it sorts the list items by its length first. 因为它首先按其长度对列表项进行排序。

def substr_sieve(list_of_strings):  
    dups_removed = list_of_strings[:]
    for i in xrange(len(list_of_strings)):
        list_of_strings.sort(key = lambda s: len(s))
        j=0
        j=i+1
        while j <= len(list_of_strings)-1:
            if list_of_strings[i] in list_of_strings[j]:
                try:
                    dups_removed.remove(list_of_strings[i])
                except:
                    pass
            j+=1
    return dups_removed

Answer 1

A simple solution. 一个简单的解决方案。

But first, let's also add ' Donald Trump ', 'Donald' and 'Trump' in the end to make it a better test case. 但是首先，我们最后还要添加“ Donald Trump ”， “ Donald”和“ Trump” ，以使其成为更好的测试用例。

>>> forbidden_text = "\nX08y6\n" # choose a text that will hardly appear in any sensible string
>>> the_list = ['Donald Trump has', 'Donald Trump has small fingers', 'What is going on?',
        'Donald Trump', 'Donald', 'Trump']
>>> new_list = [item for item in the_list if forbidden_text.join(the_list).count(item) == 1]
>>> new_list
['Donald Trump has small fingers', 'What is going on?']

Logic: 逻辑：

Concatenate all list element to form a single string. 连接所有列表元素以形成单个字符串。 forbidden_text.join(the_list) . forbidden_text.join(the_list) 。
Search if an item in the list has occurred only once. 搜索列表中的项目是否仅发生过一次。 If it occurs more than once it is a sub-string. 如果多次出现，则为子字符串。 count(item) == 1

str.count(sub[, start[, end]]) str.count（sub [，start [，end]]）

Return the number of non-overlapping occurrences of substring sub in the range [start, end] . 返回范围为[start, end]的子字符串sub的不重叠出现的次数。 Optional arguments start and end are interpreted as in slice notation. 可选参数start和end解释为切片表示法。

forbidden_text is used instead of "" (blank string), to handle a case like these : forbidden_text代替"" （空白字符串）来处理以下情况：

>>> the_list = ['DonaldTrump', 'Donald', 'Trump']

As correctly pointed by Nishant, above code fails for the_list = ['Donald', 'Donald'] 正如Nishant所正确指出的，上述代码对于the_list = ['Donald', 'Donald']失败

Using a set(the_list) instead of the_list solves the problem. 使用set(the_list)代替the_list解决了该问题。
>>> new_list = [item for item in the_list if forbidden_text.join(set(the_list)).count(item) == 1]

Answer 2

You can do this without sorting: 您可以执行此操作而无需排序：

the_list = ['Donald Trump has', "I've heard Donald Trump has small fingers",
            'What is going on?']

def winnow(a_list):
    keep = set()
    for item in a_list:
        if not any(item in other for other in a_list if item != other):
            keep.add(item)
    return [ item for item in a_list if item in keep ]

winnow(the_list)

Sorting may allow fewer comparisons overall, but that seems highly data-dependent, and could be a premature optimization. 排序可能总体上允许较少的比较，但这似乎与数据高度相关，并且可能是过早的优化。

Answer 3

You can just recursively reduce the items. 您可以递归地减少项目。

Algorithm: 算法：

Loop over each item by popping it, decide if it needs to be kept or not. 通过弹出每个项目来循环遍历，确定是否需要保留。 Call the same function recursively with the reduced list. 用精简列表递归调用相同的函数。 Base condition is if the list has at-least one item (or two?). 基本条件是列表中至少有一项（或两项？）。

Efficiency: It might not be the most efficient. 效率：可能不是最有效的。 I think some Divide and Conquer methods would be more apt? 我认为一些分而治之的方法会更合适吗？

the_list = ['Donald Trump has', 'Donald Trump has small fingers',\
            'What is going on?']

final_list = []

def remove_or_append(input):
    if len(input):
        first_value = input.pop(0)
        found = False
        for each in input:
            if first_value in each:
                found = True
                break
            else:
                continue
        for each in final_list:
            if first_value in each:
                found = True
                break
            else:
                continue
        if not found:
            final_list.append(first_value)
        remove_or_append(input)

remove_or_append(the_list)

print(final_list)

A slightly different version is: 稍有不同的版本是：

def substring_of_anything_else(item, list):
    for idx, each in enumerate(list):
        if idx == item[0]:
            continue
        else:
            if item[1] in each:
                return True
        return False

final_list = [item for idx, item in enumerate(the_list)\ 
              if not substring_of_anything_else((idx, item), the_list)]

是否在不更改列表原始顺序的情况下删除列表中其他字符串的子字符串的字符串？

问题描述

3 个解决方案

解决方案1
3 2016-10-08 11:11:48

解决方案2
1 已采纳 2016-10-08 09:45:22

解决方案3
0 2016-10-08 09:03:33

是否在不更改列表原始顺序的情况下删除列表中其他字符串的子字符串的字符串？

问题描述

3 个解决方案

解决方案1 3 2016-10-08 11:11:48

解决方案2 1 已采纳 2016-10-08 09:45:22

解决方案3 0 2016-10-08 09:03:33

解决方案1
3 2016-10-08 11:11:48

解决方案2
1 已采纳 2016-10-08 09:45:22

解决方案3
0 2016-10-08 09:03:33