简体   繁体   English

如何将高维数据框从长到宽重塑,以便后续降维+可视化?

[英]How to reshape high-dimensional data frame from long to wide for subsequent dimension reduction + visualization?

I have a data frame that resembles the following:我有一个类似于以下内容的数据框:

This looks like the following:如下所示:

index指数 attribute属性 score分数
user_1用户_1 a一个 0.144228 0.144228
user_1用户_1 b b 0.980685 0.980685
user_1用户_1 c c 0.165716 0.165716
user_2用户_2 a一个 0.795340 0.795340
user_2用户_2 b b 0.903498 0.903498
user_3用户_3 d d 0.193492 0.193492
user_3用户_3 e e 0.900509 0.900509

Here's the reproducible code:这是可重现的代码:

df = pd.DataFrame({'index':['user_1','user_1','user_1','user_2','user_2','user_3','user_3'],
                   'attribute':['a','b','c','a','b','d','e'],
              'score':[random.rand(),random.rand(),random.rand(),random.rand(),random.rand(),random.rand(),random.rand()]})


df.set_index('index',inplace=True)

I'd like to unstack/pivot this table so that the attribute values becomes column header, like so:我想取消堆叠/透视此表,以便属性值变为列 header,如下所示:

在此处输入图像描述

Now, this is fairly easy, except, I have 350K dimensions , and as you can see from the above example, not every user has scores for each dimension .现在,这相当容易,除了我有350K 个维度,从上面的例子中可以看出,并不是每个用户都有每个维度的分数

I've tried using the standard pandas pd.pivot_table() and .unstack() functions, but my kernel invariably dies when I attempt to do so.我试过使用标准的 pandas pd.pivot_table().unstack()函数,但是当我尝试这样做时,我的 kernel 总是死机。 I subsequently attempted to do so using dask, saving the output to a csv via我随后尝试使用 dask 执行此操作,将 output 通过

dask.dataframe.reshape.pivot_table(df, index='index', columns='attribute', values='score').to_csv('df.csv')

but that crashed too, yielding the following error:但这也崩溃了,产生了以下错误:

KilledWorker: ("('pivot_table_count-chunk-c31649485f27d5f8670393d66e2d14ac', 0, 3, 0)", <Worker 'tcp://127.0.0.1:56298', name: 0, memory: 0, processing: 5>)

I'm currently at a loss.我目前不知所措。 How can I reshape high-dimensional dataset for subsequent dimension reduction, clustering, and viz?如何重塑高维数据集以进行后续降维、聚类和可视化?

df.pivot(index='index', columns='attribute').reset_index().droplevel(0, axis=1)


           attribute    a         b         c         d         e
0          user_1  0.144228  0.980685  0.165716       NaN       NaN
1          user_2  0.795340  0.903498       NaN       NaN       NaN
2          user_3       NaN       NaN       NaN  0.193492  0.900509

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM