[英]Remove group of empty or nan in pandas groupby
In a dataframe, with some empty(NaN) values in some rows - Example below在数据框中,某些行中有一些空(NaN)值 - 下面的示例
s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])
When I groupby start and end columns当我分组开始和结束列时
s.groupby(['start', 'end']).total.sum()
the output I get is我得到的输出是
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
735571896 203559638 nan
282186552 nan
736453090 6126187 nan
I want to exclude all the groups of start where all values with end is 'nan' - Expected output -我想排除所有以结束为'nan'的值的开始组 - 预期输出 -
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
I tried with dropna(), but it is removing all the nan values and not nan groups.我尝试使用 dropna(),但它正在删除所有 nan 值而不是 nan 组。
I am newbie in python and pandas.我是python和pandas的新手。 Can someone help me in this?有人可以帮助我吗? thank you谢谢你
In newer pandas versions is necessary use min_count=1
for missing values if use sum
:在较新的min_count=1
版本中,如果使用sum
则必须使用min_count=1
缺失值:
s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()
Then is possible filter if at least one non missing value per first level by Series.notna
with GroupBy.transform
and GroupBy.any
, filtering is by boolean indexing
:如果Series.notna
与GroupBy.transform
和GroupBy.any
至少每个第一级有一个非缺失值,则可以过滤,过滤是通过boolean indexing
:
s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
Or is possible get unique values of first level index values by MultiIndex.get_level_values
and filtering by DataFrame.loc
:或者可以通过MultiIndex.get_level_values
获取第一级索引值的唯一值并通过DataFrame.loc
过滤:
idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.