[英]pandas: how to groupby / pivot retaining the NaNs? Converting float to str then back to float works but seems convoluted
I am tracking in which month a certain event has taken place.我正在跟踪某个事件发生在哪一个月。 If it hasn't, the "month" field is a NaN.
如果没有,则“月份”字段为 NaN。 The starting table looks like this:
起始表如下所示:
+-------+----------+---------+
| Month | Category | Balance |
+-------+----------+---------+
| 1 | a | 100 |
| nan | a | 300 |
| 2 | a | 200 |
+-------+----------+---------+
I am trying to build a crosstab like this:我正在尝试构建这样的交叉表:
+-------+----------------------------------+
| Month | Category a - cumulative % amount |
+-------+----------------------------------+
| 1 | 0.16 |
| 2 | 0.50 |
+-------+----------------------------------+
In month 1, the event has happened for 100/600, ie for 16% In month 2, the event has happened, cumulatively, for (100 + 200) / 600 = 50%, where 100 is in month 1 and 200 in month 2.在第 1 个月,该事件发生了 100/600,即 16% 在第 2 个月,该事件已发生,累计为 (100 + 200) / 600 = 50%,其中 100 在第 1 个月和 200 在一个月2.
My issue is with NaNs.我的问题是 NaN。 Pandas automatically removes NaNs from any groupby / pivot / crosstab.
Pandas 会自动从任何 groupby/pivot/crosstab 中删除 NaN。 I could convert the month field to string, so that grouping it won't remove the NaNs, but then pandas sorts by the month as if it were a string, ie it would sort: 10, 48, 5, 6.
我可以将月份字段转换为字符串,这样对其进行分组就不会删除 NaN,但是熊猫会按月份排序,就好像它是一个字符串一样,即它会排序:10、48、5、6。
Any suggestions?有什么建议?
The bit below works but seems extremely convoluted:下面的位有效,但似乎非常复杂:
Code:代码:
import numpy as np
import pandas as pd
df = pd.DataFrame()
mylen = int(10e3)
df['ix'] = np.arange(0,mylen)
df['amount'] = np.random.uniform(10e3,20e3,mylen)
df['category'] = np.where( df['ix'] <=4000, 'a','b' )
df['month'] = np.random.uniform(3,48,mylen)
df['month'] = np.where( df['ix'] <=1000, np.nan, df['month'] )
df['month rounded'] = np.ceil(df['month'])
ct = pd.crosstab(df['month rounded'].astype(str) , df['category'], \
values = df['amount'] ,aggfunc = 'sum', margins = True ,\
normalize = 'columns', dropna = False)
# the index is 'month rounded'
ct = ct.reset_index()
ct['month rounded'] = ct['month rounded'].astype('float32')
ct = ct.sort_values('month rounded')
ct = ct.set_index('month rounded')
ct2 = ct.cumsum (axis = 0)
Use:用:
new_df = df.assign(cumulative=df['Balance'].mask(df['Month'].isna())
.groupby(df['Category'])
.cumsum()
.div(df.groupby('Category')['Balance']
.transform('sum'))).dropna()
print(new_df)
Month Category Balance cumulative
0 1.0 a 100 0.166667
2 2.0 a 200 0.500000
If you want create a DataFrame
for each Category you could create a dict:如果您想为每个类别创建一个
DataFrame
,您可以创建一个 dict:
df_category = {i:group for i,group in new_df.groupby('Category')}
df['Category a - cumulative % amount'] = (
df.groupby(by=df.Month.fillna(np.inf))
.apply(lambda x: x.Balance.cumsum().div(df.Balance.sum()))
.reset_index(level=0, drop=True)
)
df.dropna()
Month Category Balance Category a - cumulative % amount
0 1 a 100 0.166667
2 2 a 200 0.333333
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.