简体   繁体   English

熊猫:如何分组/枢轴保留 NaN? 将 float 转换为 str 然后再转换回 float 有效但似乎令人费解

[英]pandas: how to groupby / pivot retaining the NaNs? Converting float to str then back to float works but seems convoluted

I am tracking in which month a certain event has taken place.我正在跟踪某个事件发生在哪一个月。 If it hasn't, the "month" field is a NaN.如果没有,则“月份”字段为 NaN。 The starting table looks like this:起始表如下所示:

+-------+----------+---------+
| Month | Category | Balance |
+-------+----------+---------+
| 1     | a        |     100 |
| nan   | a        |     300 |
| 2     | a        |     200 |
+-------+----------+---------+

I am trying to build a crosstab like this:我正在尝试构建这样的交叉表:

+-------+----------------------------------+
| Month | Category a - cumulative % amount |
+-------+----------------------------------+
|     1 |                             0.16 |
|     2 |                             0.50 |
+-------+----------------------------------+

In month 1, the event has happened for 100/600, ie for 16% In month 2, the event has happened, cumulatively, for (100 + 200) / 600 = 50%, where 100 is in month 1 and 200 in month 2.在第 1 个月,该事件发生了 100/600,即 16% 在第 2 个月,该事件已发生,累计为 (100 + 200) / 600 = 50%,其中 100 在第 1 个月和 200 在一个月2.

My issue is with NaNs.我的问题是 NaN。 Pandas automatically removes NaNs from any groupby / pivot / crosstab. Pandas 会自动从任何 groupby/pivot/crosstab 中删除 NaN。 I could convert the month field to string, so that grouping it won't remove the NaNs, but then pandas sorts by the month as if it were a string, ie it would sort: 10, 48, 5, 6.我可以将月份字段转换为字符串,这样对其进行分组就不会删除 NaN,但是熊猫会按月份排序,就好像它是一个字符串一样,即它会排序:10、48、5、6。

Any suggestions?有什么建议?

The bit below works but seems extremely convoluted:下面的位有效,但似乎非常复杂:

  • I convert 'month' to string我将“月”转换为字符串
  • Do a crosstab做一个交叉表
  • Convert month back to float (can I do it without moving the index to a column and then the column back to the index?)将月份转换回浮点数(我可以在不将索引移动到列然后将列移回索引的情况下执行此操作吗?)
  • Sort again再次排序
  • Do the cumsum做累积

Code:代码:

import numpy as np
import pandas as pd

df = pd.DataFrame()
mylen = int(10e3)
df['ix'] = np.arange(0,mylen)
df['amount'] = np.random.uniform(10e3,20e3,mylen)
df['category'] = np.where( df['ix'] <=4000, 'a','b' )
df['month'] = np.random.uniform(3,48,mylen)
df['month'] = np.where( df['ix'] <=1000, np.nan, df['month'] )
df['month rounded'] = np.ceil(df['month'])

ct = pd.crosstab(df['month rounded'].astype(str) , df['category'], \
                 values = df['amount'] ,aggfunc = 'sum', margins = True ,\
                     normalize = 'columns', dropna = False)

# the index is 'month rounded'
ct = ct.reset_index()
ct['month rounded'] = ct['month rounded'].astype('float32')
ct = ct.sort_values('month rounded')
ct = ct.set_index('month rounded')
ct2 = ct.cumsum (axis = 0)

Use:用:

new_df = df.assign(cumulative=df['Balance'].mask(df['Month'].isna())
                                           .groupby(df['Category'])
                                           .cumsum()
                                           .div(df.groupby('Category')['Balance']
                                                  .transform('sum'))).dropna()
print(new_df)
   Month Category  Balance  cumulative
0    1.0        a      100    0.166667
2    2.0        a      200    0.500000

If you want create a DataFrame for each Category you could create a dict:如果您想为每个类别创建一个DataFrame ,您可以创建一个 dict:

df_category = {i:group for i,group in new_df.groupby('Category')}
df['Category a - cumulative % amount'] = (
    df.groupby(by=df.Month.fillna(np.inf))
    .apply(lambda x: x.Balance.cumsum().div(df.Balance.sum()))
    .reset_index(level=0, drop=True)
)

df.dropna()

    Month   Category    Balance Category a - cumulative % amount
0   1       a           100     0.166667
2   2       a           200     0.333333

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM