计算 pandas 组中的 NaN 值

Question

I have a df like this:我有一个这样的df ：

Country     product                 date_install                date_purchase           id
BR          yearly                  2020-11-01-01:11:36         2020-11-01-01:11:26     10660236
CA          monthly                 2020-11-01-01:11:49         2020-11-01-01:11:32     10649441
US          yearly                  2020-11-01-01:11:54         2020-11-01-01:11:33     10660272
IT          monthly                 2020-11-01-11:11:01         2020-11-01-01:11:34     10657634
AE          monthly                 2020-11-01-01:11:38         2020-11-01-01:11:39     10661442
US          NaN                     2021-01-12-03:01:31         NaN                     12815946
CA          NaN                     2020-12-04-02:12:48         NaN                     11647714
US          NaN                     2020-12-28-11:12:54         NaN                     12323174
ID          NaN                     2021-02-02-01:02:58         NaN                     13714980
US          NaN                     2020-11-15-10:11:05         NaN                     11056138

I am trying to get this:我试图得到这个：

country     product     installs        purchases
BR          yearly      1               1
BR          NaN         100             0 # people who installed but not purchased
CA          monthly     1               1
US          yearly      10              10
US          monthly     15              15
US          NaN         500             0 # people who installed but not purchased

Or even more better would be:或者更好的是：

country     installs    yearly  monthly  total
BR          1000        10      100      110
CA          2000        50      5        55

I tried:我试过了：

df.groupby(['country','product']).count().sort_values('date_install',ascending=False)

But then all the values are the same, matching the number of purchases, meaning everyone who installs purchases.但是所有的值都是相同的，与购买数量相匹配，这意味着安装购买的每个人。

                    date_install    date_purchase   id
country product         
US      monthly     3373            3373            3373
AU      monthly     1478            1478            1478
US      yearly      954             954             954

And if I use:如果我使用：

df = df.replace(np.nan, 'empty', regex=True)
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)

I get:我得到：

                    date_install    date_purchase   id
country product         
US      empty       480153          480153          480153
AU      empty       334236          334236          334236
BR      empty       144920          144920          144920

How can I achieve this result?我怎样才能达到这个结果？

Answer 1

Indeed, if you follow @Paul Brennan's advice, the solution comes quite easier.事实上，如果您遵循@Paul Brennan 的建议，解决方案会变得非常容易。 As an example consider the following data例如，考虑以下数据

   Country  product         date_install        date_purchase        id
0       BR   yearly  2020-01-01-01:00:00  2020-01-01-01:00:00  10660236
3       BR  monthly  2020-01-01-04:00:00  2020-01-01-04:00:00  10660239
6       BR      NaN  2020-01-01-07:00:00                  NaN  10660242
9       BR      NaN  2020-01-01-10:00:00                  NaN  10660245
1       CA   yearly  2020-01-01-02:00:00  2020-01-01-02:00:00  10660237
4       CA   yearly  2020-01-01-05:00:00  2020-01-01-05:00:00  10660240
7       CA      NaN  2020-01-01-08:00:00                  NaN  10660243
10      CA   yearly  2020-01-01-11:00:00  2020-01-01-11:00:00  10660246
2       US  monthly  2020-01-01-03:00:00  2020-01-01-03:00:00  10660238
5       US      NaN  2020-01-01-06:00:00                  NaN  10660241
8       US  monthly  2020-01-01-09:00:00  2020-01-01-09:00:00  10660244
11      US  monthly  2020-01-01-12:00:00  2020-01-01-12:00:00  10660247

Let's say the 'not purchased' version is a demo or something like that:假设“未购买”版本是demo或类似的东西：

df['product'] = df['product'].fillna('demo')

You can do the following您可以执行以下操作

ans = (df.groupby([df['Country'], df['product']])
       .apply(len)
       .unstack()
       .fillna(0)
       .astype(int)
       .rename_axis(columns='', index='')
       .assign(installed=lambda x: x[['demo', 'monthly', 'yearly']].sum(axis=1),
               purchased=lambda x: x[['monthly', 'yearly']].sum(axis=1))
       )

And the resulting dataframe is as follows:得到的 dataframe 如下：

    demo  monthly  yearly  installed  purchased
                                               
BR     2        1       1          4          2
CA     1        0       3          4          3
US     1        3       0          4          3

Answering your comment's question, you can't keep the dates from each user since groupby summarizes all the information, losing these individual details.回答您的评论问题时，您无法保留每个用户的日期，因为groupby汇总了所有信息，丢失了这些个人详细信息。

What's possible to do is to assign the desirable columns from the resulting dataframe to the first one (getting some duplicate values), like:可以做的是将生成的 dataframe 中的所需列分配给第一个列（获取一些重复值），例如：

df = df.assign(purchased=df['Country'].map(ans['purchased']),
               installed=df['Country'].map(ans['installed']))

This will make you first dataframe looks like:这将使您首先 dataframe 看起来像：

   Country  product         date_install        date_purchase        id  purchased  installed
0       BR   yearly  2020-01-01-01:00:00  2020-01-01-01:00:00  10660236          2          4
1       CA   yearly  2020-01-01-02:00:00  2020-01-01-02:00:00  10660237          3          4
2       US  monthly  2020-01-01-03:00:00  2020-01-01-03:00:00  10660238          3          4
.
.
.

If this is not what you wanted, please let us know and we'll try to figure it out.如果这不是您想要的，请告诉我们，我们会尽力解决。

计算 pandas 组中的 NaN 值

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-05 20:14:46

计算 pandas 组中的 NaN 值

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-05 20:14:46

解决方案1
1 已采纳 2021-02-05 20:14:46