简体   繁体   中英

pythonic way to aggregate arrays (numpy or not)

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)


def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

Thanks for your answer Louis

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key , but you can do even better with defaultdict . Here's a version that uses defaultdict to get rid of the existence check:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby :

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas , which has even more versatile facilities for handling group-by operations .

Update 2022:

There is a package which emulates the functionality of matlabs accumarray quite well. You can install it via pip install numpy_groupies or find it here:

https://github.com/ml31415/numpy-groupies

Best flexibility and readability is get using pandas :

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

result.to_records()
result.to_dict()
result.to_csv()

And so on...

Best performance is achieved using ndimage.mean from scipy . This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM