简体   繁体   English

如何使用分组计算 pandas dataframe 中的唯一非空值?

[英]How to count unique non-null values in pandas dataframe using group by?

I'm having trouble trying to figure this out, and would really appreciate some help.我在试图解决这个问题时遇到了麻烦,非常感谢一些帮助。 So, I have a dataframe with year, state code, var1 and var2 (which contain null and non null values). So, I have a dataframe with year, state code, var1 and var2 (which contain null and non null values). I want to create a new dataframe that counts the number of unique states with at least 1 non null value and the total number of non null values grouped by year.我想创建一个新的 dataframe 来计算具有至少 1 个非 null 值的唯一状态数以及按年份分组的非 null 值的总数。

What my current df looks like:我目前的 df 是什么样的:

    year    state   var1    var2    
0   2018    1       NaN     2    
1   2018    2       1       1    
2   2018    3       NaN     NaN  
3   2018    4       1       2    
4   2018    5       NaN     1   
6   2019    1       NaN     NaN  
7   2019    2       1       1    
8   2019    3       NaN     NaN  
9   2019    4       2       1    
10  2019    5       2       NaN 

What I want the new df to look like.我希望新的 df 看起来像什么。 I want the original df transposed so that the year is the column value and my variables with the conditions are my rows.我希望原始 df 转置,以便年份是列值,而具有条件的变量是我的行。

                                                  2018    2019
var1
      Number of states with at least 1 non-null:  2       3
      Number of respondents with non-null var:    2       3
      Average                                     1       1
var2
      Number of states with at least 1 non-null:  2       2
      Number of respondents with non-null var:    4       2
      Average                                     2       1

Hopefully this makes sense.希望这是有道理的。 Thanks for looking!感谢您的关注!

There seems to be an issue with the data in the example: as stated, that data has only one row for each (state, year) , which defeats the point of making a difference between "states with at least 1 non null value" and "total number of non-null values".示例中的数据似乎存在问题:如前所述,每个(state, year)数据只有一行,这违背了在“至少有 1 个非 null 值的州”和“非空值的总数”。

One way I can think of that would produce the expected result is if the sample data was:我能想到的一种产生预期结果的方法是,如果样本数据是:

nan = float('nan')
df = pd.DataFrame({
    'year': [2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019],
    'state': [1, 2, 3, 1, 2, 1, 2, 3, 4, 5],
    #                  ^  ^ changed from OP's data
    'var1': [nan, 1.0, nan, 1.0, nan, nan, 1.0, nan, 2.0, 2.0],
    'var2': [2.0, 1.0, nan, 2.0, 1.0, nan, 1.0, nan, 1.0, nan],
})

In that case, we can get the expected result with the following:在这种情况下,我们可以通过以下方式获得预期的结果:

c = df.groupby(['year', 'state']).count()
res = (
    pd.concat([c/c, c], keys=['uniq', 'cnt'], axis=1)
    .groupby('year').sum(0).astype(int).T
    .swaplevel().sort_index(ascending=[True, False])
)
>>> res
year       2018  2019
var1 uniq     2     3
     cnt      2     3
var2 uniq     2     2
     cnt      4     2

Alternatively (and a bit less hacky):或者(并且少一点hacky):

import numpy as np

c = df.groupby(['year', 'state']).count()
res = c.groupby('year').agg([np.count_nonzero, sum]).T
res.index = res.index.set_levels(['uniq', 'cnt'], level=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM