[英]Convert pandas column of numpy arrays to numpy array of higher dimension
I have a pandas dataframe of shape (75,9)
. 我有一个形状为(75,9)
的熊猫数据(75,9)
。
Only one of those columns is of numpy arrays, each of which is of shape (100, 4, 3)
这些列中只有一列是numpy数组,每列的形状都是(100, 4, 3)
100,4,3 (100, 4, 3)
I have a strange phenomenon: 我有一个奇怪的现象:
data = self.df[self.column_name].values[0]
is of shape (100,4,3)
, but 形状为(100,4,3)
,但
data = self.df[self.column_name].values
is of shape (75,), with min
and max
are 'not a numeric object' 形状为(75,),且min
和max
不是“数字对象”
I expected data = self.df[self.column_name].values
to be of shape (75, 100, 4, 3), with some min
and max
. 我期望data = self.df[self.column_name].values
的形状为( data = self.df[self.column_name].values
),具有一些min
和max
。
How can I make a column of numpy arrays behave like a numpy array of a higher dimension (with length=number of rows in the dataframe)? 如何使一列numpy数组的行为类似于更高维度的numpy数组 (长度=数据帧中的行数)?
Reproducing: 复制:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6)]
print some_df['A'].values.shape
print some_df['A'].values[0].shape
prints (10L,)
, (4L,6L)
instead of desired (10L, 4L, 6L)
, (4L,6L)
打印(10L,)
, (4L,6L)
而不是所需的(10L, 4L, 6L)
, (4L,6L)
What you're asking for is not quite possible. 您要求的是不可能的。 Pandas DataFrames are 2D. 熊猫数据框是2D的。 Yes, you can store NumPy arrays as object
s (references) inside DataFrame cells, but this is not really well supported, and expecting to get a shape
which has one dimension from the DataFrame and two from the arrays inside is not possible at all. 是的,您可以将NumPy数组存储为DataFrame单元内的object
(引用),但这并没有得到很好的支持,并且完全不可能从DataFrame中获得具有一维的shape
,而从数组中获得具有两个维的shape
。
You should consider storing your data either entirely in NumPy arrays of the appropriate shape, or in a single, properly 2D DataFrame with MultiIndex. 您应该考虑将数据完全存储在适当形状的NumPy数组中,或者存储在具有MultiIndex的单个正确2D DataFrame中。 For example you can "pivot" a column of 1D arrays to become a column of scalars if you move the extra dimension to a new level of a MultIndex on the rows: 例如,如果将额外的维度移动到行上MultIndex的新级别,则可以“旋转”一维数组的列成为标量列:
A
x [2, 3]
y [5, 6]
becomes: 变成:
A
x 0 2
1 3
y 0 5
1 6
or pivot to the columns: 或转到列:
A
0 1
x 2 3
y 5 6
In [42]: some_df = pd.DataFrame(columns=['A'])
...: for i in range(4):
...: some_df.loc[i] = [np.random.randint(0,10,(1,3))]
...:
In [43]: some_df
Out[43]:
A
0 [[7, 0, 9]]
1 [[3, 6, 8]]
2 [[9, 7, 6]]
3 [[1, 6, 3]]
The numpy values of the column are an object dtype array, containing arrays: 列的numpy值是对象dtype数组,其中包含数组:
In [44]: some_df['A'].to_numpy()
Out[44]:
array([array([[7, 0, 9]]), array([[3, 6, 8]]), array([[9, 7, 6]]),
array([[1, 6, 3]])], dtype=object)
If those arrays all have the same shape, stack
does a nice job of concatenating them on a new dimension: 如果这些数组都具有相同的形状,则stack
可以很好地将它们连接到新的维度上:
In [45]: np.stack(some_df['A'].to_numpy())
Out[45]:
array([[[7, 0, 9]],
[[3, 6, 8]],
[[9, 7, 6]],
[[1, 6, 3]]])
In [46]: _.shape
Out[46]: (4, 1, 3)
This only works with one column. 这仅适用于一列。 stack
like all concatenate
treats the input argument as an iterable, effectively a list of arrays. 像所有concatenate
一样, stack
将输入参数视为可迭代的数组,实际上是数组的列表。
In [48]: some_df['A'].to_list()
Out[48]:
[array([[7, 0, 9]]),
array([[3, 6, 8]]),
array([[9, 7, 6]]),
array([[1, 6, 3]])]
In [50]: np.stack(some_df['A'].to_list()).shape
Out[50]: (4, 1, 3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.