简体   繁体   中英

Pandas groupby unique issue

I have a dataframe 'region_group'. As shown below, this dataframe does not have 'ARTHOG' value in 'Town/City' column. However when I do groupby-first, on this column, this value pops back in. I am trying to understand why this is happening.

Note: region_group dataframe is based on another dataframe which has 'ARTHOG' as value in 'Town/City' column. But it has been filtered out with where condition as shown below and as also evident in the Out[25]

region=k[['my_ID','Town/City','District','County','month','year']]
region=region.loc[(region['month'] == 12) & (region['year'] == 2016)]
region_noid=region.drop(['my_ID','month','year'], axis=1)

region_group=region_noid.groupby(['Town/City','District','County']).size().reset_index(name='Count')

在此处输入图片说明

Category data will carry over the category , when there is no value , will still keeping the category but fill the value as NaN

df=pd.DataFrame({'A':[1,1,3,4,5],'B':[1,2,2,2,2]})
df.A=df.A.astype('category',categories=[1,2,3,4,5])

df.groupby('A').B.first()
Out[905]: 
A
1    1.0
2    NaN
3    2.0
4    2.0
5    2.0
Name: B, dtype: float64

Solution , convert it back to str or numeric

df.A=df.A.astype(int)
df.groupby('A').B.first()
Out[907]: 
A
1    1
3    2
4    2
5    2
Name: B, dtype: int64

Or we are using remove_unused_categories

df.A=df.A.cat.remove_unused_categories()
df.groupby('A').B.first()
Out[910]: 
A
1    1
3    2
4    2
5    2
Name: B, dtype: int64

Pandas uses the product of all categorical columns in groupby operations to determine the index of the output. This means that even if a category is not represented in the underlying data, it will be represented in groupby results.

Details of this, as well as possible solutions, can be found in my question challenging the purpose of this behaviour: Pandas groupby with categories

The pandas development team have taken the stance that all combinations of categories must be representing in groupby operations on categorical series.

从 Pandas 0.23.0 开始,groupby 方法现在可以采用“observed”参数,如果将其设置为 True(默认为 False),则可以解决此问题。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM