简体   繁体   English

从pandas数据框中提取计数以外的新列groupby

[英]Extracting new columns with counts out of pandas data frame groupby

I am dealing with a pandas dataframe like this one: 我正在处理像这样的熊猫数据框:

     Day  Hour         Prio  Value
0      1     6     Critical      1
1      1    16     Critical      1
2      1    17      Content      1
3      1    17          Low      1
6      1    19     Critical      1
7      1    20         High      1
8      2    10         High      1
9      2    10          Low      2

And now I want want to group by Day and Hour while generating new columns representing the count of each value in the column Prio , which currently is present in the column value . 现在,我想按天和小时进行分组,同时生成表示列Prio列中每个值的计数的新列,该列当前存在于列value So I want to achieve this structure: 所以我想实现这个结构:

     Day  Hour  Critical  Content  Low  High
0      1     6         1        0    0     0
1      1    16         1        0    0     0
2      1    17         0        1    1     0
6      1    19         1        0    0     0
7      1    20         0        0    0     1
8      2    10         0        0    2     1

I tried different things now, but have not been rather successful. 我现在尝试了不同的方法,但还没有取得成功。 I am targeting at merging this data frame with another one containing other columns by Day and Hour in order to further aggregate them. 我的目标是将这个数据框与另一个按日和小时包含其他列的数据框合并,以进一步汇总它们。 Especially I need the percentage shares per day/hour among the priorities (at least one non-zero value is always present). 特别是我需要优先级之间每天/每小时的百分比份额(始终存在至少一个非零值)。

In a past solution I was iterating over each row to extract the single values, but this has been rather slow. 在过去的解决方案中,我遍历了每一行以提取单个值,但这相当慢。 I want to keep it as efficient as possible as the data should update live within a bokeh server app. 我想使其尽可能高效,因为数据应该在bokeh服务器应用程序中实时更新。 Maybe there is a solution without using itertuples or something similar? 也许有没有使用itertuples或类似的解决方案? Thank you! 谢谢!

df.groupby(['Day','Hour','Prio']).sum().unstack().fillna(0).astype(int)
#           Value                  
#Prio     Content Critical High Low
#Day Hour                          
#1   6          0        1    0   0
#    16         0        1    0   0
#    17         1        0    0   1
#    19         0        1    0   0
#    20         0        0    1   0
#2   10         0        0    1   2

You can further reset index, if you want. 如果需要,可以进一步重置索引。

Or you can try 或者你可以尝试

pd.pivot_table(df,values='Value',index=['Day','Hour'],columns=['Prio'],aggfunc='sum')\
     .fillna(0).astype(int)


Out[22]: 
Prio      Content  Critical  High  Low
Day Hour                              
1   6           0         1     0    0
    16          0         1     0    0
    17          1         0     0    1
    19          0         1     0    0
    20          0         0     1    0
2   10          0         0     1    2

Let's use set_index , unstack , reset_index , and rename_axis : 让我们用set_indexunstackreset_indexrename_axis

df.set_index(['Day','Hour','Prio'])['Value']\
  .unstack().fillna(0)\
  .astype(int).reset_index()\
  .rename_axis(None,1)

Output: 输出:

   Day  Hour  Content  Critical  High  Low
0    1     6        0         1     0    0
1    1    16        0         1     0    0
2    1    17        1         0     0    1
3    1    19        0         1     0    0
4    1    20        0         0     1    0
5    2    10        0         0     1    2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM