简体   繁体   中英

Python Pandas: Find the maximum for each row in a dataframe column containing a numpy array

I got a Pandas DataFrame looking like the following:

      values                                      max_val_idx
0    np.array([-0.649626, -0.662434, -0.611351])            2
1    np.array([-0.994942, -0.990448, -1.01574])             1
2    np.array([-1.012, -1.01034, -1.02732])                 0

df['values'] contains numpy arrays of a fixed length of 3 elements
df['max_val_idx] contains the index of the maximum value of the corresponding array

Since the index of the maximum element for each array is already given, what is the most efficient way to extract the maximum for each entry?
I know the data is stored somewhat silly, but I didn't create it myself. And since I got a bunch of data to process (+- 50GB, as several hundreds of pickled databases stored in a similar way), I'd like to know what is the most time efficient method.

So far I tried to loop over each element of df['max_val_idx] and use it as an index for each array found in df['values'] :

max_val = []         
for idx, values in enumerate(df['values']):
     max_val.append(values[int(df['max_val_idx'].iloc[idx])])

Is there any faster alternative to this?

I would just forget the 'max_val_idx' column. I don't think it saves time and actually is more of a pain for syntax. Sample data:

df = pd.DataFrame({ 'x': range(3) }).applymap( lambda x: np.random.randn(3) )

                                                   x
0  [-1.17106202376, -1.61211460669, 0.0198122724315]
1    [0.806819945736, 1.49139051675, -0.21434675401]
2  [-0.427272615966, 0.0939459129359, 0.496474566...

You could extract the max like this:

df.applymap( lambda x: x.max() )

          x  
0  0.019812
1  1.491391
2  0.496475

But generally speaking, life is easier if you have one number per cell. If each cell has an array of length 3, you could rearrange like this:

for i, v in enumerate(list('abc')): df[v] = df.x.map( lambda x: x[i] )
df = df[list('abc')]

          a         b         c
0 -1.171062 -1.612115  0.019812
1  0.806820  1.491391 -0.214347
2 -0.427273  0.093946  0.496475

And then do a standard pandas operation:

df.apply( max, axis=1 )

          x  
0  0.019812
1  1.491391
2  0.496475

Admittedly, this is not much easier than above, but overall the data will be much easier to work with in this form.

I don't know how the speed of this will compare, since I'm constructing a 2D matrix of all the rows, but here's a possible solution:

>>> np.choose(df['max_val_idx'], np.array(df['values'].tolist()).T)
0   -0.611351
1   -0.990448
2   -1.012000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM