简体   繁体   English

Python Pandas:查找包含numpy数组的数据框列中每行的最大值

[英]Python Pandas: Find the maximum for each row in a dataframe column containing a numpy array

I got a Pandas DataFrame looking like the following: 我有一个Pandas DataFrame,如下所示:

      values                                      max_val_idx
0    np.array([-0.649626, -0.662434, -0.611351])            2
1    np.array([-0.994942, -0.990448, -1.01574])             1
2    np.array([-1.012, -1.01034, -1.02732])                 0

df['values'] contains numpy arrays of a fixed length of 3 elements df['values']包含固定长度为3个元素的numpy数组
df['max_val_idx] contains the index of the maximum value of the corresponding array df['max_val_idx]包含相应数组的最大值的索引

Since the index of the maximum element for each array is already given, what is the most efficient way to extract the maximum for each entry? 由于已经给出了每个数组的最大元素的索引,因此提取每个条目的最大值的最有效方法是什么?
I know the data is stored somewhat silly, but I didn't create it myself. 我知道数据存储有点傻,但我自己并没有创建它。 And since I got a bunch of data to process (+- 50GB, as several hundreds of pickled databases stored in a similar way), I'd like to know what is the most time efficient method. 而且由于我收集了大量数据(+ - 50GB,因为数百个以类似方式存储的数据库),我想知道什么是最有效的方法。

So far I tried to loop over each element of df['max_val_idx] and use it as an index for each array found in df['values'] : 到目前为止,我试图遍历df['max_val_idx]每个元素,并将其用作df['values']找到的每个数组的索引:

max_val = []         
for idx, values in enumerate(df['values']):
     max_val.append(values[int(df['max_val_idx'].iloc[idx])])

Is there any faster alternative to this? 有没有更快的替代方案?

I would just forget the 'max_val_idx' column. 我会忘记'max_val_idx'列。 I don't think it saves time and actually is more of a pain for syntax. 我不认为它节省了时间,实际上更多的是语法上的痛苦。 Sample data: 样本数据:

df = pd.DataFrame({ 'x': range(3) }).applymap( lambda x: np.random.randn(3) )

                                                   x
0  [-1.17106202376, -1.61211460669, 0.0198122724315]
1    [0.806819945736, 1.49139051675, -0.21434675401]
2  [-0.427272615966, 0.0939459129359, 0.496474566...

You could extract the max like this: 你可以像这样提取最大值:

df.applymap( lambda x: x.max() )

          x  
0  0.019812
1  1.491391
2  0.496475

But generally speaking, life is easier if you have one number per cell. 但一般来说,如果每个细胞有一个数字,生活会更容易。 If each cell has an array of length 3, you could rearrange like this: 如果每个单元格都有一个长度为3的数组,则可以重新排列如下:

for i, v in enumerate(list('abc')): df[v] = df.x.map( lambda x: x[i] )
df = df[list('abc')]

          a         b         c
0 -1.171062 -1.612115  0.019812
1  0.806820  1.491391 -0.214347
2 -0.427273  0.093946  0.496475

And then do a standard pandas operation: 然后做一个标准的熊猫操作:

df.apply( max, axis=1 )

          x  
0  0.019812
1  1.491391
2  0.496475

Admittedly, this is not much easier than above, but overall the data will be much easier to work with in this form. 不可否认,这并不比上面容易得多,但总体而言,这种形式的数据更容易使用。

I don't know how the speed of this will compare, since I'm constructing a 2D matrix of all the rows, but here's a possible solution: 我不知道它的速度将如何比较,因为我正在构建所有行的2D矩阵,但这是一个可能的解决方案:

>>> np.choose(df['max_val_idx'], np.array(df['values'].tolist()).T)
0   -0.611351
1   -0.990448
2   -1.012000

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在numpy数组中查找包含最大值的行或列 - find row or column containing maximum value in numpy array Python - 消除 numpy 数组或 pandas Z6A8064B53DF4794555570C53DF4794555570 - Python - Eliminating NaN values in each row of a numpy array or pandas dataframe 将 pandas dataframe 中的每一行转换为 dataframe,其中一列在每一行中包含先前在单独列中的值数组 - Convert each row in pandas dataframe to a dataframe with one column containing in each row an array of values previously in seperate columns Python:用一列numpy数组填充pandas数据帧的一行 - Python: fill a row of a pandas dataframe with a column of an numpy array 将Pandas Dataframe行和列转换为Numpy数组 - Convert Pandas Dataframe Row and Column to Numpy Array Pandas Dataframe:使用 idxmax() 查找每列的最大值索引 - Pandas Dataframe: Find the index of maximum value for each column using idxmax() 如何在 Pandas 中的串联 dataframe 中找到列/行组合的最大值 - How to find the maximum value of a column/row combination in a concatenated dataframe in Pandas python pandas dataframe查找包含特定值的行并返回布尔值 - python pandas dataframe find row containing specific value and return boolean Pandas DataFrame 中的行重复包含一列列表(Python3) - Row duplicates in a Pandas DataFrame containing a column of lists (Python3) 将Numpy数组按列转换为Pandas DataFrame(作为单行) - Convert Numpy array to Pandas DataFrame column-wise (As Single Row)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM