[英]Python: Remove Strings in a List that are contained by at least one other String in the same List
I would love to filter my list of strings the following way: I want to exclude strings , if there is at least one other string in the same list that is " in " it .我想通过以下方式过滤我的字符串列表:如果同一列表中至少有一个其他字符串“在”它,我想排除 strings 。 Or to put this differently: I want to maintain strings, if there is no other string of the same list that is in it.
或以不同的方式把这个:我想保持的字符串,如果是在它的同一列表的任何其他字符串。 Case Sensitivity should play a role here, if possible.
如果可能,区分大小写应该在这里发挥作用。
To make this more clear, please find below an example :为了更清楚地说明这一点,请在下面找到一个示例:
My "first" list that contains every string:我的“第一个”列表包含每个字符串:
elements =["tree","TREE","treeforest","water","waterfall"]
After applying the solution, I would love to receive this list:应用解决方案后,我很想收到此列表:
elements = ["tree","TREE","water"]
For example: tree
is in treeforest
.例如:
tree
在treeforest
。 Thus, treeforest
is excluded from my list.因此,
treeforest
被排除在我的列表之外。 Same applies for water
and waterfall
.同样适用于
water
和waterfall
。 However, tree
, TREE
and water
should be maintained, because there are no others strings, that are " in " them.但是,应该维护
tree
, TREE
和water
,因为没有其他字符串“在”它们。
As I'd like to apply this to a " larger " list of strings, more efficient solutions are preferred.由于我想将此应用于“更大”的字符串列表,因此首选更有效的解决方案。
Hope this is understandable.希望这是可以理解的。 Thanks a lot in advance!!
非常感谢提前! Any help is highly appreciated.
任何帮助都受到高度赞赏。
Quite optimized function with 2 loops, which saves a lot of loop iterations:相当优化的函数,带有 2 个循环,节省了大量的循环迭代:
def filterlist(l):
# keep track of elements, which will be deleted
deletelist = [False for _ in l]
for i, el in enumerate(l):
# already in deletelist, jump right to the next el
if deletelist[i]:
continue
for j, el2 in enumerate(l):
# comparing item to itself or el2 already in deletelist?
# jump to next el2
if i == j or deletelist[j]:
continue
# the comparison everyone expects
if el in el2:
deletelist[j] = True
# also, check the other way around
# will save loop iterations later
elif el2 in el:
deletelist[i] = True
break # causes jump to next el
# create new list, keep elements that are not in deletelist
return [el for i, el in enumerate(l) if not deletelist[i]]
Usually built-in functions are faster, so let's compare it to Ed Ward's solution:通常内置函数更快,所以让我们将其与 Ed Ward 的解决方案进行比较:
# result of Ed Ward's solution using timeit:
100000 loops, best of 10: 5.38 usec per loop
# filterlist function with loops using timeit:
100000 loops, best of 10: 4.42 usec per loop
Interesting, but to get a really representative result, you should run timeit with a larger element list.有趣,但要获得真正具有代表性的结果,您应该使用更大的元素列表运行 timeit。
from copy import deepcopy
def remove_composite_words(e,elements):
temp = [x for x in elements if e in x]
temp = set(temp)
elements = list(set(elements).difference(temp))
return e,sorted(elements, key=len)
def keep_shortest_root(elements):
elements = deepcopy(elements)
elements = list(set(elements))
elements = sorted(elements, key=len)
if len(elements[0]) ==0:
elements = elements[1:]
results = []
e = elements[0]
while elements:
e,elements = remove_composite_words(e,elements)
results.append(e)
if elements:
e = elements[0]
return results
elements =["tree","TREE","treeforest","water","waterfall",'forestTREE','tree']
keep_shortest_root(elements)
This should return这应该返回
['tree', 'TREE', 'water']
How it works:这个怎么运作:
The function remove_composite_words()
tests if an element in contained in any other element in the list and save only those that match.函数
remove_composite_words()
测试一个元素是否包含在列表中的任何其他元素中,并只保存那些匹配的元素。 Then it remove the matching elements from the initial list.然后它从初始列表中删除匹配的元素。
So if you have element 'a'
and list ['a','aa','b','c']
the function will return 'a'
and the list ['b','c']
.因此,如果您有元素
'a'
和列表['a','aa','b','c']
该函数将返回'a'
和列表['b','c']
。
keep_shortest_root()
applies remove_composite_words()
to the initial list and then to the transformed list (output from remove_composite_words()
) until there are no more words left. keep_shortest_root()
将remove_composite_words()
keep_shortest_root()
应用于初始列表,然后应用于转换后的列表(来自remove_composite_words()
输出),直到没有更多单词为止。
Note that keep_shortest_root()
first gets the unique words from the input list and then sorts them by length.请注意,
keep_shortest_root()
首先从输入列表中获取唯一的单词,然后按长度对它们进行排序。 This combined with the fact that remove_composite_words()
removed the matched words from initial list make the algorithm run faster since the number of comparisons drops with the number of iterations.这与
remove_composite_words()
从初始列表中删除匹配单词的事实相结合,使算法运行得更快,因为比较次数随着迭代次数而下降。
Found a bit of a simpler solution to the one already provided, thought I might chip in为已经提供的解决方案找到了一些更简单的解决方案,我想我可能会加入
def Remove_Subset(List):
ListCopy=List
for Element1 in List:
for Element2 in List:
if (Element1 in Element2) and (Element1!= Element2):
ListCopy.remove(Element2)
return(ListCopy)
elements =["treeforest","tree","TREE","treeforest","water","waterfall","tree"]
print(Remove_Subset(elements))
>>> ['tree', 'TREE', 'water']
This is an explanation of the answer I gave in my comment这是我在评论中给出的答案的解释
I used this code:我使用了这个代码:
new_elements = list(filter(lambda item: not any(elem in item for elem in elements if elem != item), elements))
which yields:产生:
['tree', 'TREE', 'water']
I don't know how much you know about Python generator expressions, and filter
, so I'll try to explain anyway.我不知道你对 Python 生成器表达式和
filter
了解多少,所以我还是尽量解释一下。
filter
is a Python built-in function, which takes a function to use on each item in the supplied iterable (eg list, etc). filter
是一个 Python 内置函数,它需要一个函数来在提供的可迭代对象(例如列表等)中的每个项目上使用。 In our case, the function is this:在我们的例子中,函数是这样的:
lambda item: not any(elem in item for elem in elements if elem != item)
This function takes an item from the the list ( item
), and then iterates over every element in the list ( for elem in elements
), and for each element ( elem
) checks if this element is in our string ( item
).此函数从列表 (
item
) 中获取一个项目,然后遍历列表中的每个元素 ( for elem in elements
),并为每个元素 ( elem
) 检查该元素是否在我们的字符串 ( item
) 中。 Note that it skips to the next element if elem != item
, because we don't want to compare it with itself.请注意,
if elem != item
,它会跳到下一个元素,因为我们不想将它与自身进行比较。
The function any
simply keeps iterating until either the expression returned is True
, or it reaches the end.函数
any
只是不断迭代,直到返回的表达式为True
,或者到达结尾。 If there were any matches, any
returns True
, but to tell filter
to drop this item, we need to return False
, so we invert the output from any
.如果有任何匹配项,
any
返回True
,但要告诉filter
删除此项,我们需要返回False
,因此我们反转any
的输出。
We also pass to filter
our list ( elements
), and convert the result from filter
to another list
.我们还通过
filter
我们的列表( elements
),并将结果从filter
转换为另一个list
。
Note: the bonus of using any
instead of iterating over every item for every other item is that in the case of finding a match, we don't have to iterate over the entire list: any
returns at that point.注意:使用
any
而不是迭代每个其他项目的每个项目的好处是,在找到匹配项的情况下,我们不必迭代整个列表:此时any
返回。 In theory, this could be faster than two nested for-loops without a break
statement.理论上,这可能比没有
break
语句的两个嵌套 for 循环更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.