[英]Pandas - group by column and transform the data to numpy array
Having the following data frame, group A have 4 samples, B 3 samples and C 1 sample: 具有以下数据帧,组A具有4个样本,B 3个样本和C 1个样本:
group data_1 data_2
0 A 1 4
1 A 2 5
2 A 3 6
3 A 4 7
4 B 1 4
5 B 2 5
6 B 3 6
7 C 1 4
I would like to transform the data into numpy array, where each row is a group with all its samples and zero padding for groups that have fewer samples. 我想将数据转换为numpy数组,其中每一行都是一个包含所有样本的组,而对于具有较少样本的组则为零填充。
Resulting in an array like so: 导致像这样的数组:
[
[[1,4],[2,5],[3,6],[4,7]], # this is A group 4 samples
[[1,4],[2,5],[3,6],[0,0]], # this is B group 3 samples
[[1,4],[0,0],[0,0],[0,0]], # this is C group 1 sample
]
First is necessary add missing values - first solution with unstack
and stack
, counter Series is created by cumcount
. 首先需要添加缺少的值-用第一溶液unstack
和stack
,计数器系列是由创建cumcount
。
Second solution use reindex
by MultiIndex
. 第二种解决方案使用MultiIndex
reindex
。
Last use lambda function with groupby
, convert to numpy array by values
and last to lists: 最后使用lambda函数和groupby
,按values
转换为numpy数组,最后转到列表:
g = df.groupby('group').cumcount()
L = (df.set_index(['group',g])
.unstack(fill_value=0)
.stack().groupby(level=0)
.apply(lambda x: x.values.tolist())
.tolist())
print (L)
[[[1, 4], [2, 5], [3, 6], [4, 7]],
[[1, 4], [2, 5], [3, 6], [0, 0]],
[[1, 4], [0, 0], [0, 0], [0, 0]]]
Another solution: 另一种方案:
g = df.groupby('group').cumcount()
mux = pd.MultiIndex.from_product([df['group'].unique(), g.unique()])
L = (df.set_index(['group',g])
.reindex(mux, fill_value=0)
.groupby(level=0)['data_1','data_2']
.apply(lambda x: x.values.tolist())
.tolist()
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.