简体   繁体   English

熊猫数据透视表 - 重新排列

[英]pandas pivot table - rearrange

I have a pandas data frame with some columns. 我有一些带有一些列的pandas数据框。 I want to rearrange them in a different way. 我想以不同的方式重新排列它们。 An example is below: 一个例子如下:

time,name,feature,value
33 20 May 2016 14:00:00 -0700,John,badL,2
45 19 May 2016 18:00:00 -0700,John,badL,1
120 17 May 2016 11:00:00 -0700,John,badL,1
220 20 May 2016 14:00:00 -0700,John,totalL,20
450 19 May 2016 18:00:00 -0700,John,totalL,15
330 18 May 2016 15:00:00 -0700,Mary,badL,2
330 18 May 2016 15:00:00 -0700,Mary,totalL,20
550 21 May 2016 12:00:00 -0700,Mary,adCmd,4
700 22 May 2016 16:00:00 -0700,Mary,PC,3
800 22 May 2016 16:00:00 -0700,Mary,eCon,200

Note: the first column value (time) is preceded by index values (33, 45,120,...). 注意:第一列值(时间)前面是索引值(33,45,120,...)。 From the above data frame, I want the resulting data frame as: 从上面的数据框中,我希望得到的数据框如下:

time,name,badL,totalL,adCmd,PC,eCon
20 May 2016 14:00:00 -0700,John,2,20,0,0,0
19 May 2016 18:00:00 -0700,John,1,15,0,0,0
17 May 2016 11:00:00 -0700,John,1,0,0,0,0
18 May 2016 15:00:00 -0700,Mary,2,20,0,0,0
21 May 2016 12:00:00 -0700,Mary,0,0,4,0,0
22 May 2016 16:00:00 -0700,Mary,0,0,0,3,200

NOTE: for 17th may, John did not have any totalL. 注意:对于17日,约翰没有任何总数。 So, filled it with 0. 所以,用0填充它。

Is there an elegant way to do this? 有一种优雅的方式来做到这一点? I am setting the time field as a pd.to_datetime, then, comparing...looks to be tedious. 我将时间字段设置为pd.to_datetime,然后比较......看起来很乏味。 For the above example, I have only two 'features' (badL, totalL). 对于上面的例子,我只有两个'功能'(badL,totalL)。 I will have several more later. 我稍后会再说几句。

This is what I have - but, it is adding a different row for the second feature...(totalL)....rather than putting it in the same row. 这就是我所拥有的 - 但是,它为第二个特征添加了不同的行...(totalL)....而不是将它放在同一行中。

for f in ['badL', 'totalL']:
    dff = df[df.feature == f]
    print dff
    if len(dff.index) > 0:
        fullFeatureDf[f] = dff.feature_value

Setup 设定

from StringIO import StringIO
import pandas as pd

text = '''time,name,f1,value
20 May 2016 14:00:00 -0700,John,badL,2
19 May 2016 18:00:00 -0700,John,badL,1
17 May 2016 11:00:00 -0700,John,badL,1
20 May 2016 14:00:00 -0700,John,totalL,20
19 May 2016 18:00:00 -0700,John,totalL,15
17 May 2016 11:00:00 -0700,John,totalL,12
'''

df = pd.read_csv(StringIO(text))

print df

                         time  name      f1  value
0  20 May 2016 14:00:00 -0700  John    badL      2
1  19 May 2016 18:00:00 -0700  John    badL      1
2  17 May 2016 11:00:00 -0700  John    badL      1
3  20 May 2016 14:00:00 -0700  John  totalL     20
4  19 May 2016 18:00:00 -0700  John  totalL     15
5  17 May 2016 11:00:00 -0700  John  totalL     12

Solution using unstack 解决方案使用unstack

df = df.set_index(['time', 'name', 'f1'])

print df

                                        value
time                       name f1           
20 May 2016 14:00:00 -0700 John badL        2
19 May 2016 18:00:00 -0700 John badL        1
17 May 2016 11:00:00 -0700 John badL        1
20 May 2016 14:00:00 -0700 John totalL     20
19 May 2016 18:00:00 -0700 John totalL     15
17 May 2016 11:00:00 -0700 John totalL     12

then unstack to perform pivot. 然后取消堆栈以执行枢轴。 It takes part of the row index and moves it to be columns. 它占用行索引的一部分并将其移动为列。

print df.unstack()

                                value       
f1                               badL totalL
time                       name             
17 May 2016 11:00:00 -0700 John     1     12
19 May 2016 18:00:00 -0700 John     1     15
20 May 2016 14:00:00 -0700 John     2     20

In spirit, this is an identical solution to Yakym Pirozhenko. 在精神上,这是与Yakym Pirozhenko完全相同的解决方案。 Just a slightly different way of doing it. 这样做的方式略有不同。 This is more intuitive to me but may not be to you. 这对我来说更直观,但可能不适合你。

This is a job for df.pivot : 这是df.pivot的工作:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(
'''
time,name,feature,value
33 20 May 2016 14:00:00 -0700,John,badL,2
45 19 May 2016 18:00:00 -0700,John,badL,1
120 17 May 2016 11:00:00 -0700,John,badL,1
220 20 May 2016 14:00:00 -0700,John,totalL,20
450 19 May 2016 18:00:00 -0700,John,totalL,15
330 18 May 2016 15:00:00 -0700,Mary,badL,2
330 18 May 2016 15:00:00 -0700,Mary,totalL,20
550 21 May 2016 12:00:00 -0700,Mary,adCmd,4
700 22 May 2016 16:00:00 -0700,Mary,PC,3
800 22 May 2016 16:00:00 -0700,Mary,eCon,200
'''), sep=',').set_index(['time', 'name'])

df_new = df.pivot(columns='feature').fillna(0).astype(int)

#                                     value
# feature                                PC adCmd badL eCon totalL
# time                           name
# 120 17 May 2016 11:00:00 -0700 John     0     0    1    0      0
# 220 20 May 2016 14:00:00 -0700 John     0     0    0    0     20
# 33 20 May 2016 14:00:00 -0700  John     0     0    2    0      0
# 330 18 May 2016 15:00:00 -0700 Mary     0     0    2    0     20
# 45 19 May 2016 18:00:00 -0700  John     0     0    1    0      0
# 450 19 May 2016 18:00:00 -0700 John     0     0    0    0     15
# 550 21 May 2016 12:00:00 -0700 Mary     0     4    0    0      0
# 700 22 May 2016 16:00:00 -0700 Mary     3     0    0    0      0
# 800 22 May 2016 16:00:00 -0700 Mary     0     0    0  200      0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM