[英]most optimal way to retrieve date of publication from a list of html? [on hold]
[英]Most optimal way to exhaust a list of lists
我有一個列表列表,由二維數組構建:
我需要按順序從每個列表中彈出一個值,直到所有列表都用完為止。 例如:
4,3,4,9,2,82,5,4,23,3,56,7 for the lists above
因為這個列表可能會變得很大,所以我想跳過空隊列。 例如
我的解決方案是還有一個隊列索引的雙端隊列,這些隊列仍然具有這樣的循環值:
Loop continuously until we no longer have any dequeues in our queue of indexes to pop from. This will leave the entire list with an empty deque when its done. The alternative is to simply iterate through the list of deque, and remove[index_of_empty_queue] when it is done. This is extremely slow to delete items in a very large list, especially towards the start of the list.
對我的方法有什么想法,是否有更好的方法? 我知道雙端隊列的 popleft 和 append 是 O(1),我只是仍然不知道使用這種方法本質上迭代列表的整體性能影響,並且能夠刪除“列表”中的項目(雙端隊列)也有 O(1)。
使用一些解決方案進行基准測試:
Best three from ten runs:
145 ms 147 ms 148 ms columns
202 ms 203 ms 204 ms chained_removers
219 ms 220 ms 221 ms like_interleave_longest
302 ms 303 ms 304 ms with_roundrobin
313 ms 314 ms 314 ms iterators
330 ms 331 ms 333 ms iterators3
336 ms 338 ms 338 ms iterators2
366 ms 368 ms 369 ms queues_optimized
471 ms 474 ms 475 ms queues_clean
737 ms 741 ms 746 ms queues
輸入是 1000 個隨機長度從 1000 到 2000 的列表。
queues
是您的原始解決方案(編輯:您在問題中有但現在已刪除)。queues_clean
是相同的,但沒有索引,並且使用正常的真值測試而不是長度檢查。queues_optimized
是queues_clean
的優化版本。iterators
類似於queues_optimized
但使用迭代器而不是隊列。iterators2
和iterators3
是我嘗試過的帶有迭代器的其他版本,用其他東西替換了外部雙端隊列。columns
是一種不同的方法。 將輸入數據視為行。 你想要的是連接的列。 因此,為每個需要的列准備一個列表,然后將每個輸入行分布在列中。 通過按列收集完成。chained_removers
主要是zip
所有列表。 但是它在它們后面鏈接了一些可移除的迭代器,它們移除了它們耗盡的迭代器並產生了一個標記,該標記隨后也被移除(從當前“列”的值中)。 還使用OrderedDict
作為其雙向鏈表,允許 O(1) 時間刪除和后續 O(length) 時間迭代。with_roundrobin
使用 roundrobin itertools 配方。 不確定它是否重要,因為它以可能非常高的成本“跳過”耗盡的迭代器,見下文。like_interleave_longest
類似於more_itertools.interleave_longest ,但針對生成列表進行了優化。 它不會跳過用盡的內部列表,但出於好奇,我將其包含在基准測試中。我最初放棄了循環解決方案,因為您的問題使您看起來有許多非常短(甚至是空的)內部列表。 這很糟糕,例如對於 10000 個隨機長度從 1 到 5 的列表:
3 ms 3 ms 3 ms like_interleave_longest
5 ms 6 ms 6 ms columns
8 ms 8 ms 8 ms iterators
8 ms 8 ms 8 ms iterators2
8 ms 8 ms 8 ms iterators3
9 ms 9 ms 10 ms queues_optimized
12 ms 12 ms 13 ms queues_clean
18 ms 18 ms 19 ms queues
26 ms 27 ms 29 ms chained_removers
3642 ms 3750 ms 3812 ms with_roundrobin
完整代碼( 在線試用! ):
def queues(data):
data_q = [deque(i) for i in data ]
data_i = deque([i for i in range(len(data_q))])
return_list = []
while len(data_i) > 0:
index = data_i.popleft()
return_list.append(data_q[index].popleft())
if len(data_q[index]) != 0:
data_i.append(index)
return return_list
def queues_clean(data):
queues = deque(map(deque, data))
result = []
while queues:
queue = queues.popleft()
result.append(queue.popleft())
if queue:
queues.append(queue)
return result
def queues_optimized(data):
queues = deque(map(deque, data))
queues_pop = queues.popleft
queues_push = queues.append
result = []
result_append = result.append
while queues:
queue = queues_pop()
result_append(queue.popleft())
if queue:
queues_push(queue)
return result
def iterators(data):
iterators = deque(map(iter, data))
iterators_pop = iterators.popleft
iterators_push = iterators.append
result = []
result_append = result.append
next_value = next
while iterators:
iterator = iterators_pop()
try:
result_append(next_value(iterator))
iterators_push(iterator)
except StopIteration:
pass
return result
def iterators2(data):
iterators = list(map(iter, data))
result = []
result_append = result.append
next_value = next
while iterators:
alive = []
keep = alive.append
for iterator in iterators:
try:
result_append(next_value(iterator))
keep(iterator)
except StopIteration:
pass
iterators = alive
return result
def iterators3(data):
iterators = list(map(iter, data))
result = []
result_append = result.append
next_value = next
while iterators:
keep = 0
for iterator in iterators:
try:
result_append(next_value(iterator))
iterators[keep] = iterator
keep += 1
except StopIteration:
pass
del iterators[keep:]
return result
def columns(data):
columns = [[] for _ in range(max(map(len, data)))]
for row in data:
deque(map(list.append, columns, row), 0)
result = []
for column in columns:
result += column
return result
def chained_removers(data):
marker = object()
def remover(i):
del iterators[i]
yield marker
iterators = OrderedDict()
for i, d in enumerate(data):
iterators[i] = chain(d, remover(i))
result = []
while alive := len(iterators):
for values in zip(*iterators.values()):
if len(iterators) < alive:
result += compress(values, map(is_not, values, repeat(marker)))
break
result += values
return result
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
num_active = len(iterables)
nexts = cycle(iter(it).__next__ for it in iterables)
while num_active:
try:
for next in nexts:
yield next()
except StopIteration:
# Remove the iterator we just exhausted from the cycle.
num_active -= 1
nexts = cycle(islice(nexts, num_active))
def with_roundrobin(data):
return list(roundrobin(*data))
def like_interleave_longest(data):
marker = object()
return [x
for xs in zip_longest(*data, fillvalue=marker)
for x in xs
if x is not marker]
funcs = [
queues,
queues_clean,
queues_optimized,
iterators,
iterators2,
iterators3,
columns,
chained_removers,
with_roundrobin,
like_interleave_longest,
]
from timeit import default_timer as time
from random import randint, shuffle
from bisect import insort
from collections import deque, OrderedDict
from itertools import cycle, islice, chain, compress, repeat, zip_longest
from operator import is_not
data = [list(range(1000, 1000 + randint(1, 5)))
for _ in range(10000)]
data = [list(range(1000, 1000 + randint(1000, 2000)))
for _ in range(1000)]
expect = funcs[0](data)
for func in funcs:
assert func(data) == expect
times = {func: [] for func in funcs}
for _ in range(10):
shuffle(funcs)
for func in funcs:
t0 = time()
func(data)
t1 = time()
insort(times[func], t1 - t0)
for func in sorted(funcs, key=times.get):
print(*('%4d ms ' % (t * 1e3) for t in times[func][:3]), func.__name__)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.