簡體   English   中英

Python 3從子列表末尾刪除None值,或者如果子列表完全為None值則排除

[英]Python 3 remove None values from end of sublist or exclude if sublist is entirely None values

我有一個列表列表,其中一些子列表完全由None ,有些子列表在字符串的末尾和字符串之間None 我需要做三件事:

  1. 如果有子列表,請從子列表的末尾刪除None ,將中間的用空字符串分隔開的那些替換。

  2. 排除完全為None的子列表

  3. 用結果創建一個新的列表列表

我的嘗試獲得了預期的結果,但是我想知道是否有更快的方法可以做到這一點:

from itertools import islice

rows = [["row 1 index 0",None,"row 1 index 2",None,None],
        [None,"row 2 index 1",None,None,None],
        [None,None,None,None,None]]
data = []

for r in rows:
    for i,c in enumerate(reversed(r)):
        if c is not None:
            data.append(["" if x is None else
                         str(x) for x in islice(r,0,len(r)-i)])
            break
print (data)

所需的結果/輸出:

[['row 1 index 0', '', 'row 1 index 2'], ['', 'row 2 index 1']]

基准測試(據我所知):

from itertools import islice
import time

q = ["string",None,"string",None,"string"] + [None] * 95
rows = [q.copy() for i in range(500000)]

for z in range(1,6):
    st = time.time()
    data = []
    for r in rows:
        for i,c in enumerate(reversed(r)):
            if c is not None:
                data.append(["" if x is None else
                             str(x) for x in islice(r,0,len(r)-i)])
                break
    end = time.time()
    print ("Run: " + str(z) + "| time: " + str(end-st))

結果(i5 ivybridge Windows 10):

Run: 1| time: 5.787390232086182
Run: 2| time: 5.802111387252808
Run: 3| time: 5.697156190872192
Run: 4| time: 5.38789963722229
Run: 5| time: 5.739344596862793

您可以通過列表理解逐步消除None

from itertools import dropwhile

rows = [["row 1 index 0",None,"row 1 index 2",None,None],
        [None,"row 2 index 1",None,None,None],
        [None,None,None,None,None]]

# remove lists with all Nones
rows1 = [row for row in rows if set(row) != {None}]     
# remove trailing Nones
rows2 = [dropwhile(lambda x: x is None, reversed(row)) for row in rows1]
# replace None with ''
rows3 = [list(reversed([x if x is not None else '' for x in row])) for row in rows2]
print(rows3)

輸出:

[['row 1 index 0', '', 'row 1 index 2'], ['', 'row 2 index 1']]

TL:DR

性能取決於數據的性質! 請參閱以下時序,並傾向於您認為對預期遇到的數據集更好的時序。 我提出了一個部分就地的解決方案,該解決方案現在似乎在我的測試中具有最佳性能,但是希望可以很明顯地充實基准測試,以真實了解您的取舍。

首先,我進行了測試。

In [60]: rows = [["row 1 index 0",None,"row 1 index 2",None,None],
    ...:         [None,"row 2 index 1",None,None,None],
    ...:         [None,None,None,None,None]]

In [61]: rowsbig = [r*1000 for r in rows]

In [62]: rowsbig = [list(r) for _ in range(1000) for r in rowsbig]

In [63]: sum(len(r) for r in rowsbig)
Out[63]: 15000000

現在,有個小幫手可以保持衛生:

In [65]: def test_set(source=rowsbig):
     ...:     return [list(r) for r in source]
     ...:

因此,讓我們將三種建議的方法包裝在函數中:

In [86]: def new_to_coding(rows):
    ...:     data = []
    ...:     for r in rows:
    ...:         for i,c in enumerate(reversed(r)):
    ...:             if c is not None:
    ...:                 data.append(["" if x is None else
    ...:                          str(x) for x in islice(r,0,len(r)-i)])
    ...:                 break
    ...:     return data
    ...:

In [87]: def Bit(rows):
    ...:     data = [list(map(lambda x: '' if x is None else x, row)) for row in rows]
    ...:     data = [row[:max(i for i, e in enumerate(row, 1) if e is not '')] for row in data if set(row) != {''}]
    ...:     return data
    ...:

In [88]: def taras(rows):
    ...:     # remove lists with all Nones
    ...:     rows1 = [row for row in rows if set(row) != {None}]
    ...:     # remove trailing Nones
    ...:     rows2 = [dropwhile(lambda x: x is None, reversed(row)) for row in rows1]
    ...:     # replace None with ''
    ...:     rows3 = [list(reversed([x if x is not None else '' for x in row])) for row in rows2]
    ...:     return rows3
    ...:

快速檢查一下:

In [89]: taras(test_set()) == new_to_coding(test_set())
Out[89]: True

In [90]: Bit(test_set()) == new_to_coding(test_set())
Out[90]: True

現在,一些定時設置。 注意@new_to_coding始終使用timeit模塊創建基准。 天真的time.time()方法忽略了許多微妙之處,並且更加方便!

In [91]: from timeit import timeit

In [92]: setup = "from __main__ import new_to_coding, Bit, taras, test_set; testrows = test_set()"

現在,結果:

In [93]: # using OP's method
    ...: timeit('new_to_coding(testrows)', setup, number=5)
Out[93]: 5.416837869910523

In [94]: # using `Bit`
    ...: timeit('Bit(testrows)', setup, number=5)
Out[94]: 14.52187539380975

In [95]: # using `taras`
    ...: timeit('taras(testrows)', setup, number=5)
Out[95]: 3.7361009169835597

因此,似乎漸進方法是成功的! 當然,數據的確切性質可能會更改這些相對時間。 我懷疑“ None ”行的比例會影響這些方法的相對性能。 警告! 事實證明這是非常正確的! 查看編輯

我已經進行了微優化的@taras方法,確保所有名稱都在函數本地,因此無需全局查找,將list(reversed(alist))替換為alist[::-1] ,並中間轉換生成器表達式,以便僅實現一個列表:

In [111]: def is_None(x): return x is None
     ...:
     ...: def taras_micro_op(rows, dropwhile=dropwhile, reversed=reversed, set=set, is_None=is_None):
     ...:     # remove lists with all Nones
     ...:     rows1 = (row for row in rows if set(row) != {None})
     ...:     # remove trailing Nones
     ...:     rows2 = (dropwhile(is_None, reversed(row)) for row in rows1)
     ...:     # replace None with ''
     ...:     rows3 = [[x if x is not None else '' for x in row][::-1] for row in rows2]
     ...:     return rows3
     ...:

In [112]: taras_micro_op(test_set()) == taras(test_set())
Out[112]: True

In [113]: setup = "from __main__ import taras, taras_micro_op, test_set; testrows = test_set()"

In [114]: # using `taras`
     ...: timeit('taras(testrows)', setup, number=50)
Out[114]: 35.11660181987099

In [115]: # using `taras_micro_op`
     ...: timeit('taras_micro_op(testrows)', setup, number=50)
Out[115]: 33.70030225184746

In [116]: 33.70030225184746 / 35.11660181987099
Out[116]: 0.9596686611281929

不到5%的改善。 的確,如果只是為了提高內存效率,我會放棄“使用默認參數進行內聯”,而僅使用中間生成器表達式。

換句話說,我建議使用以下內容:

In [117]: def taras_memory_op(rows):
     ...:     # remove lists with all Nones
     ...:     rows1 = (row for row in rows if set(row) != {None})
     ...:     # remove trailing Nones
     ...:     rows2 = (dropwhile(lambda x: x is None, reversed(row)) for row in rows1)
     ...:     # replace None with ''
     ...:     rows3 = [[x if x is not None else '' for x in row][::-1] for row in rows2]
     ...:     return rows3
     ...:

In [118]: setup = "from __main__ import taras, taras_memory_op, test_set; testrows = test_set()"

In [119]: # using `taras`
     ...: timeit('taras(testrows)', setup, number=50)
Out[119]: 35.10479677491821

In [120]: # using `taras`
     ...: timeit('taras_memory_op(testrows)', setup, number=50)
Out[120]: 34.00812040804885

In [121]: 34.00812040804885/35.10479677491821
Out[121]: 0.9687599283396816

因為實際上大多數已經很小的改進實際上都來自使用生成器表達式!

編輯

因此,我使用op提供的測試集進行了嘗試:

In [3]: q = ["string",None,"string",None,"string"] + [None] * 95
   ...: rows = [q.copy() for i in range(500000)]
   ...:

In [4]: sum(len(r) for r in rows)
Out[4]: 50000000

注意,在我最初的測試集中,大約有33%的“所有None ”行。 但是,在上面, 沒有所有行都為None 事實證明,這肯定會影響性能。

In [7]: def test_set(source=rows):
   ...:     return [list(r) for r in source]
   ...:

In [8]: setup = "from __main__ import new_to_coding, taras_memory_op, test_set; testrows = test_set()"

In [9]: # using OP's method
   ...: timeit('new_to_coding(testrows)', setup, number=5)
Out[9]: 14.014577565016225

In [10]: # using `taras`
    ...: timeit('taras_memory_op(testrows)', setup, number=5)
Out[10]: 33.28037207596935

因此,我提出了另一種解決方案。 警告! 以下解決方案可就地更改內部列表

In [14]: def sanitize(rows):
    ...:     result = []
    ...:     for row in rows:
    ...:         tail = True
    ...:         maxidx = len(row) - 1
    ...:         for i, item in enumerate(reversed(row)):
    ...:             if item is None:
    ...:                 if tail:
    ...:                     row.pop()
    ...:                 else:
    ...:                     row[maxidx - i] = ''
    ...:             else:
    ...:                 tail = False
    ...:         if row:
    ...:             result.append(row)
    ...:     return result
    ...:

In [15]: setup = "from __main__ import new_to_coding, taras_memory_op, sanitize, test_set; testrows = test_set()"

In [16]: # using `sanitize`
    ...: timeit('sanitize(testrows)', setup, number=5)
Out[16]: 8.261458976892754

In [17]: sanitize(test_set()) == new_to_coding(test_set())
Out[17]: True

因此,使用我最初制作的測試集:

In [18]: rows = [["row 1 index 0",None,"row 1 index 2",None,None],
    ...:         [None,"row 2 index 1",None,None,None],
    ...:         [None,None,None,None,None]]

In [19]:

In [19]: rows = [r*1000 for r in rows]

In [20]: rowsbig = [list(r) for _ in range(1000) for r in rows]

In [21]: rows = rowsbig

In [22]: del rowsbig

In [23]: def test_set(source=rows):
    ...:     return [list(r) for r in source]
    ...:

In [24]: setup = "from __main__ import new_to_coding, taras_memory_op, sanitize, test_set; testrows = test_set()"

In [25]: # using `taras`
    ...: timeit('taras_memory_op(testrows)', setup, number=10)
Out[25]: 6.563127358909696

In [26]: # using OP's method
    ...: timeit('new_to_coding(testrows)', setup, number=10)
Out[26]: 10.173962660133839

In [27]: # using `sanitize`
    ...: timeit('sanitize(testrows)', setup, number=10)
Out[27]: 6.3629974271170795

我敢肯定有一種更緊湊的方法,但這是我對列表理解的看法:

data = [list(map(lambda x: '' if x is None else x, row)) for row in rows]
data = [row[:max(i for i, e in enumerate(row, 1) if e is not '')] for row in data if set(row) != {''}]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM