[英]Counting grouped data with missing values in pandas dataframe
I am trying to do something like this, but on a much larger dataframe (called Clean): 我正在尝试做这样的事情,但是在更大的数据帧(称为Clean)上:
d={'rx': [1,1,1,1,2.1,2.1,2.1,2.1],
'vals': [NaN,10,10,20,NaN,10,20,20]}
df=DataFrame(d)
arrays = [df.rx,df.vals]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])
df.index = index
Hist=df.groupby(level=('rx','vals'))
Hist.count('vals')
This seems to work just fine, but when I run the same concept on even a subset of the Clean dataframe (substituting a column 'LagBin' for 'vals') I get an error: 这似乎很好用,但是当我甚至在Clean数据帧的一个子集上运行相同的概念时(用“ LagBin”列替换为“ vals”),我得到一个错误:
df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)
arrays = [df1.rx,df1.LagBin]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','LagBin'])
df1.index = index
Hist=df1.groupby(level=('rx','LagBin'))
Hist.count('LagBin')
Specifically, the Hist.count('LagBin') produces a value error: 具体来说,Hist.count('LagBin')会产生值错误:
ValueError: Cannot convert NA to integer
I have looked at the data structure and that all seems exactly the same. 我已经看过数据结构,而且看起来似乎完全一样。
Here is the data that produces the error: 这是产生错误的数据:
rx LagBin rx LagBin
139.1 nan 139.1
139.1 0 139.1 0
139.1 0 139.1 0
139.1 0 139.1 0
141.1 nan 141.1
141.1 10 141.1 10
141.1 20 141.1 20
193 nan 193
193 50 193 50
193 20 193 20
193 3600 193 3600
193 50 193 50
193 0 193 0
193 20 193 20
193 10 193 10
193 110 193 110
193 80 193 80
193 460 193 460
193 30 193 30
193 0 193 0
while the original routine that works produces this: 而有效的原始例程会产生以下结果:
rx vals rx vals
1 nan 1
1 10 1 10
1 10 1 10
1 20 1 20
2.1 nan 2.1
2.1 10 2.1 10
2.1 20 2.1 20
2.1 20 2.1 20
What is different about these datasets that produces this error? 这些产生此错误的数据集有何不同?
If I'm understanding your question correctly I believe what you want is: 如果我正确理解了您的问题,我相信您想要的是:
Hist.agg(len).dropna()
The full code implementation looks like this: 完整的代码实现如下所示:
d={'rx': [139.1,139.1,139.1,139.1,141.1,141.1,141.1,193,193,193,193,193,193,193,193,193,193,193,193,193],
'vals': [nan,0,0,0,nan,10,20,nan,50,20,3600,50,0,20,10,110,80,460,30,0]}
df=pd.DataFrame(d)
arrays = [df.rx,df.vals]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])
df.index = index
Hist=df.groupby(level=('rx','vals'))
print(Hist.agg(len).dropna())
Where df
looks like: df
如下所示:
rx vals
rx vals
139.1 NaN 139.1 NaN
0 139.1 0
0 139.1 0
0 139.1 0
141.1 NaN 141.1 NaN
10 141.1 10
20 141.1 20
193.0 NaN 193.0 NaN
50 193.0 50
20 193.0 20
3600 193.0 3600
50 193.0 50
0 193.0 0
20 193.0 20
10 193.0 10
110 193.0 110
80 193.0 80
460 193.0 460
30 193.0 30
0 193.0 0
And the line Hist.agg(len).dropna()
looks like: 而Hist.agg(len).dropna()
看起来像:
rx vals
rx vals
139.1 0 3 3
141.1 10 1 1
20 1 1
193.0 0 2 2
10 1 1
20 2 2
30 1 1
50 2 2
80 1 1
110 1 1
460 1 1
3600 1 1
That looks right---I have been tinkering with groupby and came up with this solution, which seems more elegant, and does not require explicitly dealing with the na's: 看起来不错-我一直在修改groupby并提出了这个解决方案,它看起来更优雅,并且不需要显式处理na:
df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)
df1["rx"].groupby((df1["rx"],df1["LagBin"])).count().reset_index(name="Count")
print(LagCount)
which gives me: 这给了我:
rx LagBin Count
0 139.1 0 3
1 141.1 10 1
2 141.1 20 1
3 193.0 0 2
4 193.0 10 1
5 193.0 20 2
6 193.0 30 1
7 193.0 50 2
8 193.0 80 1
9 193.0 110 1
10 193.0 460 1
11 193.0 3600 1
I like this better, because I retain values as values and not indices, which I assume will make life easier later for plotting. 我更喜欢这种方式,因为我将值保留为值而不是索引,我认为这样会使以后的绘制工作变得更轻松。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.