简体   繁体   中英

Python - group by multiple columns

I have a list of lists - representing a table with 4 columns and many rows (10000+).

Each sub-list contains 4 variables.

Here is a small part of my table:

['1810569', 'a', 5, '1241.52']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
['1810569', 'a', 5, '1993.52']

The first column represents house-hold ID, and the second represents member id in the household.

The fourth column represents weights that I want to sum - distinctly for each member.

For the example above I want the output to be:

['1810569', 'a', 5, '3235.04']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']

In another words - to sum the weights in lines 1 and 5 since they are weights of the same user - while all the other users are distinct.

I saw something about group by in pandas - but I didn't understand how exactly to use it for my problem.

Assuming the following is your list then the following would work:

In [192]:
l=[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
l

Out[192]:
[['1810569', 'a', 5, '1241.52'],
 ['1437437', 'a', 5, '1123.90'],
 ['1437437', 'b', 5, '1232.43'],
 ['1810569', 'b', 5, '1321.31'],
 ['1810569', 'a', 5, '1993.52']]

In [201]:
# construct the df and convert the last column to float    
df = pd.DataFrame(l, columns=['household ID', 'Member ID', 'some col', 'weights'])
df['weights'] = df['weights'].astype(float)
df

Out[201]:
  household ID Member ID  some col  weights
0      1810569         a         5  1241.52
1      1437437         a         5  1123.90
2      1437437         b         5  1232.43
3      1810569         b         5  1321.31
4      1810569         a         5  1993.52

So we can now groupby on the household and member id and call sum on the 'weights' column:

In [200]:    
df.groupby(['household ID', 'Member ID'])['weights'].sum().reset_index()

Out[200]:
  household ID Member ID  weights
0      1437437         a  1123.90
1      1437437         b  1232.43
2      1810569         a  3235.04
3      1810569         b  1321.31

You could do it with a dict, using the first three elements as keys to group the data by:

d = {}
for k, b, c, w in l:
    if (k, b, c) in d:
        d[k, b, c][-1] += float(w)
    else:
        d[k, b, c] = [k, b, c, float(w)]

from pprint import  pprint as pp

pp(list(d.values()))

Output:

[['1810569', 'b', 5, 1321.31],
 ['1437437', 'b', 5, 1232.43],
 ['1437437', 'a', 5, 1123.9],
 ['1810569', 'a', 5, 3235.04]]

If you wanted to maintain a first seen order:

from collections import OrderedDict
d = OrderedDict()
for k, b, c, w in l:
    if (k, b, c) in d:
        d[k, b, c][-1] += float(w)
    else:
        d[k, b, c] = [k, b, c, float(w)]

from pprint import pprint as pp

pp(list(d.values()))

Output:

[['1810569', 'a', 5, 3235.04],
 ['1437437', 'a', 5, 1123.9],
 ['1437437', 'b', 5, 1232.43],
 ['1810569', 'b', 5, 1321.31]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM