简体   繁体   English

Python:删除列表中至少由同一列表中的一个其他字符串包含的字符串

[英]Python: Remove Strings in a List that are contained by at least one other String in the same List

I would love to filter my list of strings the following way: I want to exclude strings , if there is at least one other string in the same list that is " in " it .我想通过以下方式过滤我的字符串列表:如果同一列表中至少有一个其他字符串“”它我想排除 strings Or to put this differently: I want to maintain strings, if there is no other string of the same list that is in it.或以不同的方式把这个:我想保持的字符串,如果是它的同一列表的任何其他字符串。 Case Sensitivity should play a role here, if possible.如果可能,区分大小写应该在这里发挥作用。

To make this more clear, please find below an example :为了更清楚地说明这一点,请在下面找到一个示例

My "first" list that contains every string:我的“第一个”列表包含每个字符串:

elements =["tree","TREE","treeforest","water","waterfall"]

After applying the solution, I would love to receive this list:应用解决方案后,我很想收到此列表:

elements = ["tree","TREE","water"]

For example: tree is in treeforest .例如: treetreeforest Thus, treeforest is excluded from my list.因此, treeforest被排除在我的列表之外。 Same applies for water and waterfall .同样适用于waterwaterfall However, tree , TREE and water should be maintained, because there are no others strings, that are " in " them.但是,应该维护treeTREEwater ,因为没有其他字符串“”它们。

As I'd like to apply this to a " larger " list of strings, more efficient solutions are preferred.由于我想将此应用于“更大”的字符串列表,因此首选更有效的解决方案。

Hope this is understandable.希望这是可以理解的。 Thanks a lot in advance!!非常感谢提前! Any help is highly appreciated.任何帮助都受到高度赞赏。

Quite optimized function with 2 loops, which saves a lot of loop iterations:相当优化的函数,带有 2 个循环,节省了大量的循环迭代:

def filterlist(l):
    # keep track of elements, which will be deleted
    deletelist = [False for _ in l]

    for i, el in enumerate(l):
        # already in deletelist, jump right to the next el
        if deletelist[i]:
            continue

        for j, el2 in enumerate(l):
            # comparing item to itself or el2 already in deletelist?
            # jump to next el2
            if i == j or deletelist[j]:
                continue

            # the comparison everyone expects
            if el in el2:
                deletelist[j] = True

            # also, check the other way around
            # will save loop iterations later
            elif el2 in el:
                deletelist[i] = True
                break # causes jump to next el

    # create new list, keep elements that are not in deletelist
    return [el for i, el in enumerate(l) if not deletelist[i]]

Usually built-in functions are faster, so let's compare it to Ed Ward's solution:通常内置函数更快,所以让我们将其与 Ed Ward 的解决方案进行比较:

# result of Ed Ward's solution using timeit:
100000 loops, best of 10: 5.38 usec per loop

# filterlist function with loops using timeit:
100000 loops, best of 10: 4.42 usec per loop

Interesting, but to get a really representative result, you should run timeit with a larger element list.有趣,但要获得真正具有代表性的结果,您应该使用更大的元素列表运行 timeit。

from copy import deepcopy

def remove_composite_words(e,elements):
  temp = [x for x in elements if e in x]
  temp = set(temp)
  elements = list(set(elements).difference(temp))
  return e,sorted(elements, key=len)

def keep_shortest_root(elements):
  elements = deepcopy(elements)
  elements = list(set(elements))
  elements = sorted(elements, key=len)
  if len(elements[0]) ==0:
    elements = elements[1:]

  results = []
  e = elements[0]
  while elements:
    e,elements = remove_composite_words(e,elements)
    results.append(e)
    if elements:
      e = elements[0]

  return results
  
elements =["tree","TREE","treeforest","water","waterfall",'forestTREE','tree']

keep_shortest_root(elements)  

This should return这应该返回

['tree', 'TREE', 'water']

How it works:这个怎么运作:

The function remove_composite_words() tests if an element in contained in any other element in the list and save only those that match.函数remove_composite_words()测试一个元素是否包含在列表中的任何其他元素中,并只保存那些匹配的元素。 Then it remove the matching elements from the initial list.然后它从初始列表中删除匹配的元素。

So if you have element 'a' and list ['a','aa','b','c'] the function will return 'a' and the list ['b','c'] .因此,如果您有元素'a'和列表['a','aa','b','c']该函数将返回'a'和列表['b','c']

keep_shortest_root() applies remove_composite_words() to the initial list and then to the transformed list (output from remove_composite_words() ) until there are no more words left. keep_shortest_root()remove_composite_words() keep_shortest_root()应用于初始列表,然后应用于转换后的列表(来自remove_composite_words()输出),直到没有更多单词为止。

Note that keep_shortest_root() first gets the unique words from the input list and then sorts them by length.请注意, keep_shortest_root()首先从输入列表中获取唯一的单词,然后按长度对它们进行排序。 This combined with the fact that remove_composite_words() removed the matched words from initial list make the algorithm run faster since the number of comparisons drops with the number of iterations.这与remove_composite_words()从初始列表中删除匹配单词的事实相结合,使算法运行得更快,因为比较次数随着迭代次数而下降。

Found a bit of a simpler solution to the one already provided, thought I might chip in为已经提供的解决方案找到了一些更简单的解决方案,我想我可能会加入

 def Remove_Subset(List):
    ListCopy=List
    for Element1 in List:
        for Element2 in List:
            if (Element1 in Element2) and (Element1!= Element2):
                ListCopy.remove(Element2)
    return(ListCopy)
elements =["treeforest","tree","TREE","treeforest","water","waterfall","tree"]
print(Remove_Subset(elements))


>>> ['tree', 'TREE', 'water']

This is an explanation of the answer I gave in my comment这是我在评论中给出的答案的解释


I used this code:我使用了这个代码:

new_elements = list(filter(lambda item: not any(elem in item for elem in elements if elem != item), elements))

which yields:产生:

['tree', 'TREE', 'water']

I don't know how much you know about Python generator expressions, and filter , so I'll try to explain anyway.我不知道你对 Python 生成器表达式和filter了解多少,所以我还是尽量解释一下。

filter is a Python built-in function, which takes a function to use on each item in the supplied iterable (eg list, etc). filter是一个 Python 内置函数,它需要一个函数来在提供的可迭代对象(例如列表等)中的每个项目上使用。 In our case, the function is this:在我们的例子中,函数是这样的:

lambda item: not any(elem in item for elem in elements if elem != item)

This function takes an item from the the list ( item ), and then iterates over every element in the list ( for elem in elements ), and for each element ( elem ) checks if this element is in our string ( item ).此函数从列表 ( item ) 中获取一个项目,然后遍历列表中的每个元素 ( for elem in elements ),并为每个元素 ( elem ) 检查该元素是否在我们的字符串 ( item ) 中。 Note that it skips to the next element if elem != item , because we don't want to compare it with itself.请注意, if elem != item ,它会跳到下一个元素,因为我们不想将它与自身进行比较。

The function any simply keeps iterating until either the expression returned is True , or it reaches the end.函数any只是不断迭代,直到返回的表达式为True ,或者到达结尾。 If there were any matches, any returns True , but to tell filter to drop this item, we need to return False , so we invert the output from any .如果有任何匹配项, any返回True ,但要告诉filter删除此项,我们需要返回False ,因此我们反转any的输出。

We also pass to filter our list ( elements ), and convert the result from filter to another list .我们还通过filter我们的列表( elements ),并将结果从filter转换为另一个list


Note: the bonus of using any instead of iterating over every item for every other item is that in the case of finding a match, we don't have to iterate over the entire list: any returns at that point.注意:使用any而不是迭代每个其他项目的每个项目的好处是,在找到匹配项的情况下,我们不必迭代整个列表:此时any返回。 In theory, this could be faster than two nested for-loops without a break statement.理论上,这可能比没有break语句的两个嵌套 for 循环更快。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除包含在同一字符串列表中的其他字符串中的字符串? - How can I drop strings contained in other string contained in the same string list? 查找字符串列表中的至少一个字符串是否没有字符(python) - Find if at least one string of a list of strings hasn't a character (python) 检查字符串是否包含列表中的至少一个字符串 - Check if a string contains at least one of the strings in a list 搜索列表中的元素是否在其他列表中至少包含一次 - Search if elements from list are contained at least once in other list 用python中的字符串列表替换一个字符串列表 - Replace one string list with list of strings in python Python从列表中删除一个措辞字符串 - Python remove one worded strings from list Python:查找另一个列表中包含次数最少的列表元素 - Python: Find the list element that is contained least times in another list Python:从字符串列表中删除一部分字符串 - Python: Remove a portion of a string from a list of strings 什么是最快的算法:在字符串列表中,删除作为另一个字符串的子字符串的所有字符串 [Python(或其他语言)] - What is the fastest algorithm: in a string list, remove all the strings which are substrings of another string [Python (or other language)] 如果列表包含在同一嵌套列表Python的另一个列表中,则将其删除 - Remove list if it's contained in another list within the same nested list Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM