简体   繁体   中英

Group list-of-tuples by second element, take average of first element

I have a list of tuples (x,y) like:

l = [(2,1), (4,6), (3,1), (2,7), (7,10)]

Now I want to make a new list:

l = [(2.5,1), (4,6), (2,7), (7,10)]

with the new list having the average of the first value (x) of tuples if there are more than one tuple with the same second value (y) in the tuple.

Here since for (x,y) = (2,1) and (3,1) the second element in the tuple y=1 is common therefore the average of x=2 and 3 is in the new list. y=1 does not occur anywhere else, therefore the other tuples remain unchanged.

Since you tagged pandas :

l = [(2,1), (4,6), (3,1), (2,7), (7,10)]
df = pd.DataFrame(l)

Then df is a data frame with two columns:

    0   1
0   2   1
1   4   6
2   3   1
3   2   7
4   7   10

Now you want to compute the average of the numbers in column 0 with the same value in column 1 :

(df.groupby(1).mean()     # compute mean on each group
   .reset_index()[[0,1]]  # restore the column order
   .values                # return the underlying numpy array
 )

Output:

array([[ 2.5,  1. ],
       [ 4. ,  6. ],
       [ 2. ,  7. ],
       [ 7. , 10. ]])

First form a hashtable/dict of all the second elements as key and their corresponding value as a list of values. Then with a listcomp you can get the desired output by iterating over the items of the dict.

from collections import defaultdict
out = defaultdict(list)
for i in l:
    out[i[1]] += [i[0]]
out = [(sum(v)/len(v), k) for k, v in out.items()]
print(out)
#prints [(2.5, 1), (4.0, 6), (2.0, 7), (7.0, 10)]

Another way using groupby :

from itertools import groupby

# Sort list by the second element
sorted_list = sorted(l,key=lambda x:x[1])

# Group by second element
grouped_list = groupby(sorted_list, key=lambda x:x[1])

result = []
for _,group in grouped_list:
    x,y = list(zip(*group))
    # Take the mean of the first elements
    result.append((sum(x) / len(x),y[0]))

You get:

[(2.5, 1), (4.0, 6), (2.0, 7), (7.0, 10)]

Here is a method using numpy.bincount . It relies on the labels being nonnegative integers. (If this is not the case one can do np.unique(i, return_inverse=True) first).

w,i = zip(*l)
n,d = np.bincount(i,w), np.bincount(i)
v, = np.where(d)
[*zip(n[v]/d[v],v)]
# [(2.5, 1), (4.0, 6), (2.0, 7), (7.0, 10)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM