简体   繁体   中英

How to count unique non-null values in pandas dataframe using group by?

I'm having trouble trying to figure this out, and would really appreciate some help. So, I have a dataframe with year, state code, var1 and var2 (which contain null and non null values). I want to create a new dataframe that counts the number of unique states with at least 1 non null value and the total number of non null values grouped by year.

What my current df looks like:

    year    state   var1    var2    
0   2018    1       NaN     2    
1   2018    2       1       1    
2   2018    3       NaN     NaN  
3   2018    4       1       2    
4   2018    5       NaN     1   
6   2019    1       NaN     NaN  
7   2019    2       1       1    
8   2019    3       NaN     NaN  
9   2019    4       2       1    
10  2019    5       2       NaN 

What I want the new df to look like. I want the original df transposed so that the year is the column value and my variables with the conditions are my rows.

                                                  2018    2019
var1
      Number of states with at least 1 non-null:  2       3
      Number of respondents with non-null var:    2       3
      Average                                     1       1
var2
      Number of states with at least 1 non-null:  2       2
      Number of respondents with non-null var:    4       2
      Average                                     2       1

Hopefully this makes sense. Thanks for looking!

There seems to be an issue with the data in the example: as stated, that data has only one row for each (state, year) , which defeats the point of making a difference between "states with at least 1 non null value" and "total number of non-null values".

One way I can think of that would produce the expected result is if the sample data was:

nan = float('nan')
df = pd.DataFrame({
    'year': [2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019],
    'state': [1, 2, 3, 1, 2, 1, 2, 3, 4, 5],
    #                  ^  ^ changed from OP's data
    'var1': [nan, 1.0, nan, 1.0, nan, nan, 1.0, nan, 2.0, 2.0],
    'var2': [2.0, 1.0, nan, 2.0, 1.0, nan, 1.0, nan, 1.0, nan],
})

In that case, we can get the expected result with the following:

c = df.groupby(['year', 'state']).count()
res = (
    pd.concat([c/c, c], keys=['uniq', 'cnt'], axis=1)
    .groupby('year').sum(0).astype(int).T
    .swaplevel().sort_index(ascending=[True, False])
)
>>> res
year       2018  2019
var1 uniq     2     3
     cnt      2     3
var2 uniq     2     2
     cnt      4     2

Alternatively (and a bit less hacky):

import numpy as np

c = df.groupby(['year', 'state']).count()
res = c.groupby('year').agg([np.count_nonzero, sum]).T
res.index = res.index.set_levels(['uniq', 'cnt'], level=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM