简体   繁体   English

Pandas dataframe 来自 numpy 多索引数组

[英]Pandas dataframe from numpy array with multiindex

I'm working with a numpy array called array_test with shape (5, 359, 2).我正在使用一个名为 array_test 的 numpy 数组,其形状为 (5, 359, 2)。 This is checked with array_test.shape .这是用array_test.shape检查的。 The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.该数组反映了 5 次重复实验中观察结果的平均值和不确定性。

The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.这样做的目的是能够估计 5 次重复实验中每次观察的平均值,并估计每次观察的总不确定性,也是 5 次重复的平均值。

I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.我需要从中创建一个 pandas dataframe,我相信有一个多索引,其中第一级将有来自第一维的 5 个值(简单命名为“1”、“2”等),第二个将是“平均”和“不确定性”。

Suggestions are more than welcome!非常欢迎提出建议!

IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack: IIUC,你可能想在 numpy 中聚合,然后构造一个 DataFrame 并堆栈:

a = np.random.random((5, 359, 2))

out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
                   columns=['mean', 'uncertainty']).stack()

Output (a Series): Output(一个系列):

1  mean           0.499102
   uncertainty    0.511757
2  mean           0.480295
   uncertainty    0.473132
3  mean           0.500507
   uncertainty    0.519352
4  mean           0.505443
   uncertainty    0.493672
5  mean           0.514302
   uncertainty    0.519299
dtype: float64

For a DataFrame:对于 DataFrame:

out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
                   columns=['mean', 'uncertainty']).stack().to_frame('value')

Output: Output:

                  value
1 mean         0.499102
  uncertainty  0.511757
2 mean         0.480295
  uncertainty  0.473132
3 mean         0.500507
  uncertainty  0.519352
4 mean         0.505443
  uncertainty  0.493672
5 mean         0.514302
  uncertainty  0.519299

I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.我会使用普通的 Dataframe 来处理它,但会添加观察和实验编号的列。

import numpy as np
import pandas as pd

a = np.random.rand(5, 10, 2)

# Get the shape
n_experiments, n_observations, n_values = a.shape

# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)

# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])

# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation

The Dataframe now looks like this: Dataframe 现在看起来像这样:

print(df.head(15))

      mean  uncertainty  experiment  observation
0   0.741436     0.775086           0            0
1   0.401934     0.277716           0            1
2   0.148269     0.406040           0            2
3   0.852485     0.702986           0            3
4   0.240930     0.644746           0            4
5   0.309648     0.914761           0            5
6   0.479186     0.495845           0            6
7   0.154647     0.422658           0            7
8   0.381012     0.756473           0            8
9   0.939797     0.764821           0            9
10  0.994342     0.019140           1            0
11  0.300225     0.992146           1            1
12  0.265698     0.823469           1            2
13  0.791907     0.555051           1            3
14  0.503281     0.249237           1            4

Now you can analyze the Dataframe (with groupby and mean ):现在您可以分析 Dataframe(使用groupbymean ):

# Only the mean 
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())


                 mean  uncertainty
observation                       
0            0.699324     0.506369
1            0.382288     0.456324
2            0.333396     0.324469
3            0.690545     0.564583
4            0.365198     0.555231
5            0.453545     0.596149
6            0.526988     0.395162
7            0.565689     0.569904
8            0.425595     0.415944
9            0.731776     0.375612

Or with more advanced aggregate functions, which are probably useful for your usecase:或者使用更高级的聚合函数,这些函数可能对您的用例有用:

# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))



                 mean                     uncertainty                    
                 mean       min       max        mean       min       max
observation                                                              
0            0.699324  0.297030  0.994342    0.506369  0.019140  0.974842
1            0.382288  0.063046  0.810411    0.456324  0.108774  0.992146
2            0.333396  0.148269  0.698921    0.324469  0.009539  0.823469
3            0.690545  0.175471  0.895190    0.564583  0.260557  0.721265
4            0.365198  0.015501  0.726352    0.555231  0.249237  0.929258
5            0.453545  0.111355  0.807582    0.596149  0.101421  0.914761
6            0.526988  0.323945  0.786167    0.395162  0.007105  0.691998
7            0.565689  0.154647  0.813336    0.569904  0.302157  0.964782
8            0.425595  0.116968  0.567544    0.415944  0.014439  0.756473
9            0.731776  0.411324  0.939797    0.375612  0.085988  0.764821

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM