[英]How to reshape high-dimensional data frame from long to wide for subsequent dimension reduction + visualization?
I have a data frame that resembles the following:我有一个类似于以下内容的数据框:
This looks like the following:如下所示:
index![]() |
attribute![]() |
score![]() |
---|---|---|
user_1![]() |
a![]() |
0.144228 ![]() |
user_1![]() |
b ![]() |
0.980685 ![]() |
user_1![]() |
c ![]() |
0.165716 ![]() |
user_2![]() |
a![]() |
0.795340 ![]() |
user_2![]() |
b ![]() |
0.903498 ![]() |
user_3![]() |
d ![]() |
0.193492 ![]() |
user_3![]() |
e ![]() |
0.900509 ![]() |
Here's the reproducible code:这是可重现的代码:
df = pd.DataFrame({'index':['user_1','user_1','user_1','user_2','user_2','user_3','user_3'],
'attribute':['a','b','c','a','b','d','e'],
'score':[random.rand(),random.rand(),random.rand(),random.rand(),random.rand(),random.rand(),random.rand()]})
df.set_index('index',inplace=True)
I'd like to unstack/pivot this table so that the attribute values becomes column header, like so:我想取消堆叠/透视此表,以便属性值变为列 header,如下所示:
Now, this is fairly easy, except, I have 350K dimensions , and as you can see from the above example, not every user has scores for each dimension .现在,这相当容易,除了我有350K 个维度,从上面的例子中可以看出,并不是每个用户都有每个维度的分数。
I've tried using the standard pandas pd.pivot_table()
and .unstack()
functions, but my kernel invariably dies when I attempt to do so.我试过使用标准的 pandas
pd.pivot_table()
和.unstack()
函数,但是当我尝试这样做时,我的 kernel 总是死机。 I subsequently attempted to do so using dask, saving the output to a csv via我随后尝试使用 dask 执行此操作,将 output 通过
dask.dataframe.reshape.pivot_table(df, index='index', columns='attribute', values='score').to_csv('df.csv')
but that crashed too, yielding the following error:但这也崩溃了,产生了以下错误:
KilledWorker: ("('pivot_table_count-chunk-c31649485f27d5f8670393d66e2d14ac', 0, 3, 0)", <Worker 'tcp://127.0.0.1:56298', name: 0, memory: 0, processing: 5>)
I'm currently at a loss.我目前不知所措。 How can I reshape high-dimensional dataset for subsequent dimension reduction, clustering, and viz?
如何重塑高维数据集以进行后续降维、聚类和可视化?
df.pivot(index='index', columns='attribute').reset_index().droplevel(0, axis=1)
attribute a b c d e
0 user_1 0.144228 0.980685 0.165716 NaN NaN
1 user_2 0.795340 0.903498 NaN NaN NaN
2 user_3 NaN NaN NaN 0.193492 0.900509
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.