简体   繁体   English

将numpy数组的pandas列转换为高维的numpy数组

[英]Convert pandas column of numpy arrays to numpy array of higher dimension

I have a pandas dataframe of shape (75,9) . 我有一个形状为(75,9)的熊猫数据(75,9)

Only one of those columns is of numpy arrays, each of which is of shape (100, 4, 3) 这些列中只有一列是numpy数组,每列的形状都是(100, 4, 3) 100,4,3 (100, 4, 3)

I have a strange phenomenon: 我有一个奇怪的现象:

data = self.df[self.column_name].values[0]

is of shape (100,4,3) , but 形状为(100,4,3) ,但

data = self.df[self.column_name].values

is of shape (75,), with min and max are 'not a numeric object' 形状为(75,),且minmax不是“数字对象”

I expected data = self.df[self.column_name].values to be of shape (75, 100, 4, 3), with some min and max . 我期望data = self.df[self.column_name].values的形状为( data = self.df[self.column_name].values ),具有一些minmax

How can I make a column of numpy arrays behave like a numpy array of a higher dimension (with length=number of rows in the dataframe)? 如何使一列numpy数组的行为类似于更高维度的numpy数组 (长度=数据帧中的行数)?


Reproducing: 复制:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6)]
    print some_df['A'].values.shape
    print some_df['A'].values[0].shape

prints (10L,) , (4L,6L) instead of desired (10L, 4L, 6L) , (4L,6L) 打印(10L,)(4L,6L)而不是所需的(10L, 4L, 6L)(4L,6L)

What you're asking for is not quite possible. 您要求的是不可能的。 Pandas DataFrames are 2D. 熊猫数据框是2D的。 Yes, you can store NumPy arrays as object s (references) inside DataFrame cells, but this is not really well supported, and expecting to get a shape which has one dimension from the DataFrame and two from the arrays inside is not possible at all. 是的,您可以将NumPy数组存储为DataFrame单元内的object (引用),但这并没有得到很好的支持,并且完全不可能从DataFrame中获得具有一维的shape ,而从数组中获得具有两个维的shape

You should consider storing your data either entirely in NumPy arrays of the appropriate shape, or in a single, properly 2D DataFrame with MultiIndex. 您应该考虑将数据完全存储在适当形状的NumPy数组中,或者存储在具有MultiIndex的单个正确2D DataFrame中。 For example you can "pivot" a column of 1D arrays to become a column of scalars if you move the extra dimension to a new level of a MultIndex on the rows: 例如,如果将额外的维度移动到行上MultIndex的新级别,则可以“旋转”一维数组的列成为标量列:

  A
x [2, 3]
y [5, 6]

becomes: 变成:

    A
x 0 2
  1 3
y 0 5
  1 6

or pivot to the columns: 或转到列:

  A
  0 1
x 2 3
y 5 6
In [42]: some_df = pd.DataFrame(columns=['A']) 
    ...: for i in range(4): 
    ...:         some_df.loc[i] = [np.random.randint(0,10,(1,3))] 
    ...:                                                                                  
In [43]: some_df                                                                          
Out[43]: 
             A
0  [[7, 0, 9]]
1  [[3, 6, 8]]
2  [[9, 7, 6]]
3  [[1, 6, 3]]

The numpy values of the column are an object dtype array, containing arrays: 列的numpy值是对象dtype数组,其中包含数组:

In [44]: some_df['A'].to_numpy()                                                          
Out[44]: 
array([array([[7, 0, 9]]), array([[3, 6, 8]]), array([[9, 7, 6]]),
       array([[1, 6, 3]])], dtype=object)

If those arrays all have the same shape, stack does a nice job of concatenating them on a new dimension: 如果这些数组都具有相同的形状,则stack可以很好地将它们连接到新的维度上:

In [45]: np.stack(some_df['A'].to_numpy())                                                
Out[45]: 
array([[[7, 0, 9]],

       [[3, 6, 8]],

       [[9, 7, 6]],

       [[1, 6, 3]]])
In [46]: _.shape                                                                          
Out[46]: (4, 1, 3)

This only works with one column. 这仅适用于一列。 stack like all concatenate treats the input argument as an iterable, effectively a list of arrays. 像所有concatenate一样, stack将输入参数视为可迭代的数组,实际上是数组的列表。

In [48]: some_df['A'].to_list()                                                           
Out[48]: 
[array([[7, 0, 9]]),
 array([[3, 6, 8]]),
 array([[9, 7, 6]]),
 array([[1, 6, 3]])]
In [50]: np.stack(some_df['A'].to_list()).shape                                           
Out[50]: (4, 1, 3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM