[英]Remove string element in a list of strings if the first characters match with another string element in the list
我想查找和比較有效的列表中的字符串元素,然后刪除其是其他字符串元素的部件列表(具有相同的起點)
list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]
我打算得到一個如下所示的列表:
list2 = [ 'green apples are worse' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]
換句話說,我想保留那些以相同的第一個字符開頭的元素中最長的字符串元素。
這是你可以實現的一種方式: -
list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell']
list2 = []
for i in list1:
bool = True
for j in list1:
if id(i) != id(j) and j.startswith(i): bool = False
if bool: list2.append(i)
>>> list2
['green apples are worse', ' this is another sentence ', 'a boy ran towards the mill and fell']
正如John Coleman在評論中所建議的那樣,您可以先對句子進行排序,然后比較連續的句子。 如果一個句子是另一個句子的前綴,它將出現在排序列表中的句子之前,所以我們只需比較連續的句子。 要保留原始訂單,您可以使用一set
來快速查找已過濾的元素。
list1 = ['a boy ran', 'green apples are worse',
'a boy ran towards the mill', ' this is another sentence ',
'a boy ran towards the mill and fell']
srtd = sorted(list1)
filtered = set(list1)
for a, b in zip(srtd, srtd[1:]):
if b.startswith(a):
filtered.remove(a)
list2 = [x for x in list1 if x in filtered]
之后, list2
如下:
['green apples are worse',
' this is another sentence ',
'a boy ran towards the mill and fell']
使用O(nlogn),這比比較O(n²)中的所有句子對要快得多,但如果列表不是太長, Vicrobot的更簡單的解決方案也可以正常工作。
你如何處理關於如何處理['a','ab','ac','add']
的問題的方式有些含糊不清。 我假設你想要['ab','ac','add']
。
下面另外假設您沒有任何空字符串。 這不是一個好的假設。
基本上,我們正在從輸入值構建樹,並且只保留葉節點。 這可能是最復雜的方法。 我認為它有可能是最有效的 ,但我不確定 這不是你要求的 。
from collections import defaultdict
from itertools import groupby
from typing import Collection, Dict, Generator, Iterable, List, Union
# Exploded is a recursive data type representing a culled list of strings as a tree of character-by-character common prefixes. The leaves are the non-common suffixes.
Exploded = Dict[str, Union["Exploded", str]]
def explode(subject:Iterable[str])->Exploded:
heads_to_tails = defaultdict(list)
for s in subject:
if s:
heads_to_tails[s[0]].append(s[1:])
return {
head: prune_or_follow(tails)
for (head, tails)
in heads_to_tails.items()
}
def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
if 1 < len(tails):
return explode(tails)
else: #we just assume it's not empty.
return tails[0]
def implode(tree: Exploded, prefix :Iterable[str] = ()) -> Generator[str, None, None]:
for (head, continued) in tree.items():
if isinstance(continued, str):
yield ''.join((*prefix, head, continued))
else:
yield from implode(continued, (*prefix, head))
def cull(subject: Iterable[str]) -> Collection[str]:
return list(implode(explode(subject)))
print(cull(['a','ab','ac','add']))
print(cull([ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell']))
print(cull(['a', 'ab', 'ac', 'b', 'add']))
編輯:
我把一些電話弄平了,我希望通過這種方式更容易閱讀和推理。 令我煩惱的是,我無法弄清楚這個過程的運行時復雜性。 我認為它是O(nm),其中m是重疊前綴的長度,與字符串比較的O(nm log(n))相比...
編輯:
我在Code Review中啟動了另一個問題 ,希望有人可以幫助我弄清楚復雜性。 那里的某個人指出,所寫的代碼實際上並不起作用: groupby
是對其名稱的任何合理解釋的垃圾。 我已經換掉了上面的代碼,並且這種方式也更容易閱讀。
編輯:
好的,我已經為CR導入了一些很好的建議。 在這一點上,我很確定我的運行時復雜性比基於排序的選項更好。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.