简体   繁体   中英

Consolidate data in numpy arrays

I'm currently working with data in numpy arrays, and was wondering if there was a way to group the data by a certain column and have the underlying data combined as individual arrays nested under the grouped item. That probably sounds confusing, so hopefully this example makes a bit more sense:

    Array_1: 
    [[x, y, z, 1],
     [x, b, z, 2],
     [a, b, z, 3],
     [a, c, z, 4]]

I'd like it to come out like this:

Array_New:
[
  [x,
    [
      [y], [
        [z],[[1],[2]]
      ]
    ],
    [
      [b], [
        [z],[[2],[3]]
      ]
    ]
...]

Essentially, the hierarchy I've tried to get is this: - if the first columns match, combine everything below it under one value as a series of subarrays (not one entry like I see with append() ). Where the second column also matches, then combine everything under one value for the second column.

So from my example using Array_1 we would have: [x] , then under [x] [y] and [b] . Under [y] I would have [z] then under [z] [1] & [2] . Under [b] I would also have [z] , but then under [z] I would have [2] & [3] .

Does anyone know the best way to do this? I tried working with numpy's vstack / hstack , but couldn't get it to work. I feel like there must be a better way to do this than iterating through each entry, seeing if it matches any others, etc. Pandas' groupby function was close, but doesn't give you the ability to preserve this hierarcy (if there were two b 's in column 2 their specific z 's would not be assigned to a single b ). I also tried using a DataSet from pandas, but I'm not very familiar with using those, so after trying for a while I figured I'd come here. Any help is much appreciated.

This sort of problem is exactly what pandas is designed for. It will allow you to group, index, aggregate or filter your data in more or less any way you can imagine.

Let's start with your example array:

import pandas as pd

a1 = pd.DataFrame({'group':['x', 'x', 'a', 'a'],
                   'product':['y', 'b', 'b', 'c'],
                   'date':['z1', 'z2', 'z3', 'z4'],
                   'performance':range(1, 5)})
print(a1)

#   date group  performance product
# 0   z1     x            1       y
# 1   z2     x            2       b
# 2   z3     a            3       b
# 3   z4     a            4       c

Pandas' groupby function was close, but doesn't give you the ability to preserve this hierarcy (if there were two b's in column 2 their specific z's would not be assigned to a single b).

Did you know that you can group by multiple columns simultaneously by passing a list/tuple of column names to .groupby() ?

for name, group in a1.groupby(('group', 'product')):
    print(name)
    print(group)

# ('a', 'b')
#   date group  performance product
# 2   z3     a            3       b
# ('a', 'c')
#   date group  performance product
# 3   z4     a            4       c
# ('x', 'b')
#   date group  performance product
# 1   z2     x            2       b
# ('x', 'y')
#   date group  performance product
# 0   z1     x            1       y

You could achieve your desired 'hierarchical' organisation using multilevel indexing :

a1.set_index(['group', 'product'])
print(a1)

#               date  performance
# group product
# x     y         z1            1
#       b         z2            2
# a     b         z3            3
#       c         z4            4

Numpy can't do this, because numpy works with multidimensional rectangular/cuboidal/etc arrays. Your output array is not one of these, but a jagged array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM