I have a dataframe 'region_group'. As shown below, this dataframe does not have 'ARTHOG' value in 'Town/City' column. However when I do groupby-first, on this column, this value pops back in. I am trying to understand why this is happening.
Note: region_group dataframe is based on another dataframe which has 'ARTHOG' as value in 'Town/City' column. But it has been filtered out with where condition as shown below and as also evident in the Out[25]
region=k[['my_ID','Town/City','District','County','month','year']]
region=region.loc[(region['month'] == 12) & (region['year'] == 2016)]
region_noid=region.drop(['my_ID','month','year'], axis=1)
region_group=region_noid.groupby(['Town/City','District','County']).size().reset_index(name='Count')
Category data will carry over the category , when there is no value , will still keeping the category but fill the value as NaN
df=pd.DataFrame({'A':[1,1,3,4,5],'B':[1,2,2,2,2]})
df.A=df.A.astype('category',categories=[1,2,3,4,5])
df.groupby('A').B.first()
Out[905]:
A
1 1.0
2 NaN
3 2.0
4 2.0
5 2.0
Name: B, dtype: float64
Solution , convert it back to str or numeric
df.A=df.A.astype(int)
df.groupby('A').B.first()
Out[907]:
A
1 1
3 2
4 2
5 2
Name: B, dtype: int64
Or we are using remove_unused_categories
df.A=df.A.cat.remove_unused_categories()
df.groupby('A').B.first()
Out[910]:
A
1 1
3 2
4 2
5 2
Name: B, dtype: int64
Pandas uses the product of all categorical columns in groupby
operations to determine the index of the output. This means that even if a category is not represented in the underlying data, it will be represented in groupby
results.
Details of this, as well as possible solutions, can be found in my question challenging the purpose of this behaviour: Pandas groupby with categories
The pandas development team have taken the stance that all combinations of categories must be representing in groupby
operations on categorical series.
从 Pandas 0.23.0 开始,groupby 方法现在可以采用“observed”参数,如果将其设置为 True(默认为 False),则可以解决此问题。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.