简体   繁体   中英

Split long list of tuples into nested lists

I have a long list of tuple that I need to turn into a nested list structure.

The long list of tuples is a list of data structured like this:

[(0.8, A), (-0.4, B), (1.0, C), (0.5, D), (-0.7, E)]

And I have a list of lengths like this:

[2, 2, 1]

My goal is to end with a nested list like this:

[[(0.8, A), (-0.4, B)], [(1.0, C), (0.5, D)], [(-0.7, E)]]

Essentially the length list tells how many tuples from the list of tuples to put in each nested list but I can't figure out how to do this.

Not so Pythonic but a for loop will do:

x = [(0.8, 'A'), (-0.4, 'B'), (1.0, 'C'), (0.5, 'D'), (-0.7, 'E')]
lns = [2, 2, 1]

res = []
start = 0
for ln in lns:
    res.append(x[start:start+ln])
    start += ln
    
print(res)

Output:

[[(0.8, 'A'), (-0.4, 'B')], [(1.0, 'C'), (0.5, 'D')], [(-0.7, 'E')]]

I tried to compare performance of the two approaches. One with iterators (using next):

import timeit

tuples = [(0.8, "a"), (-0.4, "b"), (1.0, "c"), (0.5, "d"), (-0.7, "e")]
nums = [2, 2, 1]

def collect(tuples, nums):
    tuples_iter = iter(tuples)
    res = []
    for num in nums:
         batch = [next(tuples_iter) for i in range(num)]
         res.append(batch)
    return res

times = 1000000
res = timeit.timeit('collect(tuples, nums)', globals=globals(), number=times)
print(res, res/times)

As a result it shows time:

$ py3 test.py
1.6941756070009433 1.6941756070009433e-06

Another one, with direct access to dict values:

import timeit

tuples = [(0.8, "a"), (-0.4, "b"), (1.0, "c"), (0.5, "d"), (-0.7, "e")]
nums = [2, 2, 1]

def collect(x, lns):
    res = []
    start = 0
    for ln in lns:
        res.append(x[start:start+ln])
        start += ln
    return res

times = 1000000
res = timeit.timeit('collect(tuples, nums)', globals=globals(), number=times)
print(res, res/times)

then I've got the runtime:

$ py3 test2.py
0.6738436879822984 6.738436879822985e-07

We see an apparent difference in the performance even with a small tuple list. Let's generate a long one by replacing initial conditions with:

tuples = [(0.8, "a") for i in range(10000)]
nums = [2 for i in range(5000)]

(it doesn't matter whether it is same tuple or different ones, and I generated same size pattern for simplicity)

also, I reduced number of "runs" to 10000, otherwise it would take a while to wait. so here is the result:

$ py3 test.py
45.54160467500333 0.0045541604675003325
$ py3 test2.py
18.69937555701472 0.001869937555701472

Looks like the solution with the "direct access" is 3 times faster.

Update

Well, to test "pop" method we have to reinstantiate state each time we run the test. For consistency I implemented all 3 approaches within the same framework:

from timeit import default_timer as timer


def prepare():
    tuples = [(0.8, "a") for i in range(10000)]
    nums = [2 for i in range(5000)]
    return tuples, nums


def collect_direct(x, lns):
    res = []
    start = 0
    for ln in lns:
        res.append(x[start:start+ln])
        start += ln
    return res


def collect_iter(tuples, nums):
    tuples_iter = iter(tuples)
    res = []
    for num in nums:
        batch = [next(tuples_iter) for i in range(num)]
        res.append(batch)
    return res


def collect_pop(source, pattern):
    res = []
    for size in pattern:
        # pop(0) takes the first element out of the list
        res.append([source.pop(0) for x in range(size)])
    return res


def test(func, times):
    total = 0
    for i in range(times):
        tuples, nums = prepare()
        start = timer()
        func(tuples, nums)
        total += timer() - start
    return total


times = 1000

print("Times", times)

print("Iter")
res = test(collect_iter, times)
print(res, res/times)

print("Direct")
res = test(collect_direct, times)
print(res, res/times)

print("Pop")
res = test(collect_pop, times)
print(res, res/times)

And here is the output:

$ py3 test4.py
Times 1000
Iter
2.9237239362555556 0.0029237239362555558
Direct
1.5656779609271325 0.0015656779609271325
Pop
22.700828287401237 0.022700828287401238

Because pop not only expected to be similar to "iteration" approach by complexity (we access each element of the initial list), but also has to remove the first element from the list.

As we have learned from the experiments, python lists are not really lists and have a high performance on the direct access. Thus, I would expect popping the elements from such a list might corrupt "index" that has to be rebalanced.

Conclusion: direct access is the winner. Moreover, it will be much greater winner for patterns containing larger chunks. For instance pattern [3,3,3...] I would expect to run 3 times faster on "direct access" in comparison to "iter", and [5, 5, 5, ...] 5 times faster correspondingly (though I didn't check this).

Iterate over your 'pattern' and consume your 'long list':

source = [(0.8, "A"), (-0.4, "B"), (1.0, "C"), (0.5, "D"), (-0.7, "E")]
pattern = [2, 2, 1]
res = []

for size in pattern:
    # pop(0) takes the first element out of the list
    res.append([source.pop(0) for x in range(size)])

print(res)

One-Liner:

res = [[source.pop(0) for x in range(size)] for size in pattern]

Out:

[[(0.8, 'A'), (-0.4, 'B')], [(1.0, 'C'), (0.5, 'D')], [(-0.7, 'E')]]

I would just a list comprehension:

source =[(0.8, 'A'), (-0.4, 'B'), (1.0, 'C'), (0.5, 'D'), (-0.7, 'E')]
lns = [2,2,1]

res = [[source.pop(0) for _ in range(ln)] for ln in lns]
print(res)

Output:

[[(0.8, 'A'), (-0.4, 'B')], [(1.0, 'C'), (0.5, 'D')], [(-0.7, 'E')]]

Warning: This solution consumes source , meaning that source will be an empty list after creating res . If you do not want that to happen, you have to create a copy of the list first.

Create an iterator over the list and simply pull the indicated amount each time:

l = [(0.8, 'A'), (-0.4, 'B'), (1.0, 'C'), (0.5, 'D'), (-0.7, 'E')]
sizes = [2, 2, 1]

it = iter(l)
res = [[next(it) for _ in range(size)] for size in sizes]
print(res)

Will give:

[[(0.8, 'A'), (-0.4, 'B')], 
 [(1.0, 'C'), (0.5, 'D')], 
 [(-0.7, 'E')]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM