從python中的列表中刪除多個重復值

Question

我正在處理由報表軟件生成的大（〜5000行）文本文件。 這些文件每頁有多個標題行，並且在整個過程中都有許多空行。 我已經找到了一種過濾掉不需要的數據的方法，但是我想知道這是否是實現此目的的最佳方法。 我有用於過濾列表的此函數，它基本上是遍歷列表並通過每次刪除其中一條過濾器行來減少列表。

def process_block(b):
    b1 = [line for line in b if not line.startswith('100   V')]
    b2 = [line for line in b1 if not line.startswith('300   V')]
    b3 = [line for line in b2 if not line.startswith('400   V')]
    b4 = [line for line in b3 if not line.startswith('AR00000')]
    b5 = [line for line in b4 if not line.startswith('734 - C')]
    b6 = [line for line in b5 if not line.lstrip().startswith('TXN DAT')]
    b7 = [line for line in b6 if not line.startswith('   ACCO')]
    b8 = [line for line in b7 if not line.rstrip() == '']
    return b8

我覺得自己做的傳球次數超過了必要。 有沒有更好的方法來完成此過濾？

Answer 1

str.startswith()方法接受一個前綴元組。 因此，您可以使用一個列表推導，而不是多個循環，並將所有模式傳遞給一個startswith()方法。

而且，您可以使用以下生成器函數以更Python化的方式從文件中返回經過迭代器過濾的對象：

def filter(file_name):
    prefixes = ("100   V", "300   V", "400   V",...)
    with open(file_name) as f:
        for line in f:
            if not line.lstrip().startswith(prefixes):
                yield line

如果您不考慮內存使用，則可以使用列表推導將文件對象過濾掉，這是一種更快的方法。

filtered_obj = [line for line in file_object if not line.lstrip().startswith(prefixes)]

Answer 2

您絕對可以一口氣做到這一點。


def process_block(b)
    return [line for line in b if  
        not line.startswith(
                ('100   V', '300   V', '400   V', 'AR00000', '734 - C', '   ACCO')
            )
        and not line.lstrip().startswith('TXN DAT')
        and not line.rstrip() == '']

Answer 3

您可能會發現以下方法很有用：

鑒於：

a = ['test', 'test_1', 'test_2', 'test_3', 'test']

b = ['test']

我們可以減去b從a如下：

c = list(set(a) - set(b))

print(c)

產生：

['test_3', 'test_2', 'test_1']

或者我們可以按以下方式刪除重復項：

c = list(dict(zip(a, [None]*len(a))).keys())

print(c)

產生：

['test_3', 'test_2', 'test', 'test_1']

請注意，在后一種方法中，訂單丟失。 如果您希望保留訂單，請使用Python本機庫中的collections.OrderedDict 。

現在只需要拆分字符串並對其進行操作即可。

Answer 4

將您的模式放在列表中，然后您可以否決任何給定的行

patterns = ['aaa' , 'bbb']
any(line.startswith(p) for p in patterns)

要處理整個文件，請使用filter構建迭代器

for line in filter(lambda l: not any(l.startswith(p) for p in patterns), fp):
    print(line)

Answer 5

str.startswith可以接受元組而不是字符串：

return [line for line in b if not line.startswith(
    '100   V', '300   V', '400   V', 'AR00000', '734 - C', '   ACCO'
    ) and not line.lstrip().startswith('TXN DATE') and line.rstrip() != '']

從python中的列表中刪除多個重復值

問題描述

5 個解決方案

解決方案1
0 2016-02-18 18:44:23

解決方案2
0 已采納 2016-02-18 18:54:56

解決方案3
0 2016-02-18 18:55:27

解決方案4
0 2016-02-18 18:56:53

解決方案5
-1 2016-02-18 18:54:13

從python中的列表中刪除多個重復值

問題描述

5 個解決方案

解決方案1 0 2016-02-18 18:44:23

解決方案2 0 已采納 2016-02-18 18:54:56

解決方案3 0 2016-02-18 18:55:27

解決方案4 0 2016-02-18 18:56:53

解決方案5 -1 2016-02-18 18:54:13

解決方案1
0 2016-02-18 18:44:23

解決方案2
0 已采納 2016-02-18 18:54:56

解決方案3
0 2016-02-18 18:55:27

解決方案4
0 2016-02-18 18:56:53

解決方案5
-1 2016-02-18 18:54:13