简体   繁体   中英

Itertools groupby with lambda function, group sublists of a list together if they have matching values at indices 0 and 1

I have a list of lists like this:

data = [['a', 'b', 2000, 100], ['a', 'b', 4000, 500], ['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000], ['a', 'd', 2000, 100], ['a', 'd', 1000, 100]]

and I want to group them together if they have the same first two values. Output would be:

data = [(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]), (['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]), (['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]

The sublists with the same first two values are always adjacent to each other in list, but they vary in the number of how many I need to group.

I tried this:

from itertools import groupby
data = [['a', 'b', 2000, 100], ['a', 'b', 4000, 500], ['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000], ['a', 'd', 2000, 100], ['a', 'd', 1000, 100]]
output = [list(group) for key, group in groupby(data, lambda x:x[0])]

new_data = []
for l in output:
    new_output = [tuple(group) for key, group in groupby(l, lambda x:x[1])]
    for grouped_sub in new_output:
        new_data.append(grouped_sub)

print(new_data)

and got the output:

[(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]), (['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]), (['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]

Which is exactly what I was looking for. However, my list of lists is len(data) = 1000000 and I know this could be much more efficient if I could skip the for loops entirely and somehow get the groupby lambda to consider both x[0] and x[1] when grouping.. but I do not really understand how lambda functions in groupby work all too well yet.

Modify the key lambda to return a tuple containing both elements:

groupby(data, lambda x: tuple(x[0:2]))

ie can be done in a single for-loop / list comprehension:

>>> [tuple(group) for key, group in groupby(data, lambda x: tuple(x[0:2]))]
[(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]), 
 (['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]), 
 (['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]

Why not just group by first 2 items directly:

from itertools import groupby

data = [['a', 'b', 2000, 100], ['a', 'b', 4000, 500], ['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000], ['a', 'd', 2000, 100], ['a', 'd', 1000, 100]]
res = [tuple(g) for k, g in groupby(data, key=lambda x: x[:2])]
print(res)

The output:

[(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]), (['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]), (['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM