pandas group by and find first non null value for all columns

Question

I have pandas DF as below ,

id  age   gender  country  sales_year
1   None   M       India    2016
2   23     F       India    2016
1   20     M       India    2015
2   25     F       India    2015
3   30     M       India    2019
4   36     None    India    2019

I want to group by on id, take the latest 1 row as per sales_date with all non null element.

output expected,

id  age   gender  country  sales_year
1   20     M       India    2016
2   23     F       India    2016
3   30     M       India    2019
4   36     None    India    2019

In pyspark,

df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))

But i need same solution in pandas .

EDIT :: This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.

Answer 1

Use GroupBy.first :

df1 = df.groupby('id', as_index=False).first()
print (df1)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

Answer 2

Use -

df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()

Output

id
1    20
2    23
3    30
4    36
Name: age, dtype: object

Remove the ['age'] to get full rows -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()

Output

   age gender country  sales_year
id                               
1   20      M   India        2015
2   23      F   India        2016
3   30      M   India        2019
4   36   None   India        2019

You can put the id back as a column with reset_index() -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()

Output

   id age gender country  sales_year
0   1  20      M   India        2015
1   2  23      F   India        2016
2   3  30      M   India        2019
3   4  36   None   India        2019

Answer 3

print(df.replace('None',np.NaN).groupby('id').first())

first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()

pandas group by and find first non null value for all columns

Question

3 answers

solution1
13 ACCPTED 2019-11-26 10:16:21

solution2
0 2019-11-26 10:12:56

solution3
0 2019-11-26 10:20:06

pandas group by and find first non null value for all columns

Question

3 answers

solution1 13 ACCPTED 2019-11-26 10:16:21

solution2 0 2019-11-26 10:12:56

solution3 0 2019-11-26 10:20:06

solution1
13 ACCPTED 2019-11-26 10:16:21

solution2
0 2019-11-26 10:12:56

solution3
0 2019-11-26 10:20:06