簡體   English   中英

如果第一個字符與列表中的另一個字符串元素匹配,則刪除字符串列表中的字符串元素

[英]Remove string element in a list of strings if the first characters match with another string element in the list

我想查找和比較有效的列表中的字符串元素,然后刪除其是其他字符串元素的部件列表(具有相同的起點)

list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

我打算得到一個如下所示的列表:

list2 = [  'green apples are worse' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

換句話說,我想保留那些以相同的第一個字符開頭的元素中最長的字符串元素。

這是你可以實現的一種方式: -

list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell']
list2 = []
for i in list1:
    bool = True
    for j in list1:
        if id(i) != id(j) and j.startswith(i): bool = False
    if bool: list2.append(i)
>>> list2
['green apples are worse', ' this is another sentence ', 'a boy ran towards the mill and fell']

正如John Coleman在評論中所建議的那樣,您可以先對句子進行排序,然后比較連續的句子。 如果一個句子是另一個句子的前綴,它將出現在排序列表中的句子之前,所以我們只需比較連續的句子。 要保留原始訂單,您可以使用一set來快速查找已過濾的元素。

list1 = ['a boy ran', 'green apples are worse', 
         'a boy ran towards the mill', ' this is another sentence ',
         'a boy ran towards the mill and fell']                                                                

srtd = sorted(list1)
filtered = set(list1)
for a, b in zip(srtd, srtd[1:]):
    if b.startswith(a):
        filtered.remove(a)

list2 = [x for x in list1 if x in filtered]                                     

之后, list2如下:

['green apples are worse',
 ' this is another sentence ',
 'a boy ran towards the mill and fell']

使用O(nlogn),這比比較O(n²)中的所有句子對要快得多,但如果列表不是太長, Vicrobot的更簡單的解決方案可以正常工作。

你如何處理關於如何處理['a','ab','ac','add']的問題的方式有些含糊不清。 我假設你想要['ab','ac','add']

下面另外假設您沒有任何空字符串。 這不是一個好的假設。

基本上,我們正在從輸入值構建樹,並且只保留葉節點。 這可能是最復雜的方法。 認為它有可能是最有效的 ,但我不確定 這不是你要求的

from collections import defaultdict
from itertools import groupby
from typing import Collection, Dict, Generator, Iterable, List, Union

# Exploded is a recursive data type representing a culled list of strings as a tree of character-by-character common prefixes. The leaves are the non-common suffixes.
Exploded = Dict[str, Union["Exploded", str]]

def explode(subject:Iterable[str])->Exploded:
    heads_to_tails = defaultdict(list)
    for s in subject:
        if s:
            heads_to_tails[s[0]].append(s[1:])
    return {
        head: prune_or_follow(tails)
        for (head, tails)
        in heads_to_tails.items()
    }

def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
    if 1 < len(tails):
        return explode(tails)
    else: #we just assume it's not empty.
        return tails[0]

def implode(tree: Exploded, prefix :Iterable[str] = ()) -> Generator[str, None, None]:
    for (head, continued) in tree.items():
        if isinstance(continued, str):
            yield ''.join((*prefix, head, continued))
        else:
            yield from implode(continued, (*prefix, head))

def cull(subject: Iterable[str]) -> Collection[str]:
    return list(implode(explode(subject)))

print(cull(['a','ab','ac','add']))
print(cull([ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell']))
print(cull(['a', 'ab', 'ac', 'b', 'add']))

編輯:
我把一些電話弄平了,我希望通過這種方式更容易閱讀和推理。 令我煩惱的是,我無法弄清楚這個過程的運行時復雜性。 認為它是O(nm),其中m是重疊前綴的長度,與字符串比較的O(nm log(n))相比...

編輯:
我在Code Review中啟動了另一個問題 ,希望有人可以幫助我弄清楚復雜性。 那里的某個人指出,所寫的代碼實際上並不起作用: groupby是對其名稱的任何合理解釋的垃圾。 我已經換掉了上面的代碼,並且這種方式也更容易閱讀。

編輯:
好的,我已經為CR導入了一些很好的建議。 在這一點上,我很確定我的運行時復雜性比基於排序的選項更好。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM