在Pandas数据框中计算缺少值的分组数据

Question

I am trying to do something like this, but on a much larger dataframe (called Clean): 我正在尝试做这样的事情，但是在更大的数据帧（称为Clean）上：

d={'rx': [1,1,1,1,2.1,2.1,2.1,2.1],
     'vals': [NaN,10,10,20,NaN,10,20,20]}
df=DataFrame(d)


arrays = [df.rx,df.vals]                    
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])           
df.index = index

Hist=df.groupby(level=('rx','vals'))
Hist.count('vals')

This seems to work just fine, but when I run the same concept on even a subset of the Clean dataframe (substituting a column 'LagBin' for 'vals') I get an error: 这似乎很好用，但是当我甚至在Clean数据帧的一个子集上运行相同的概念时（用“ LagBin”列替换为“ vals”），我得到一个错误：

df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)

arrays = [df1.rx,df1.LagBin]                    
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','LagBin'])            
df1.index = index

Hist=df1.groupby(level=('rx','LagBin'))
Hist.count('LagBin')

Specifically, the Hist.count('LagBin') produces a value error: 具体来说，Hist.count（'LagBin'）会产生值错误：

ValueError: Cannot convert NA to integer

I have looked at the data structure and that all seems exactly the same. 我已经看过数据结构，而且看起来似乎完全一样。

Here is the data that produces the error: 这是产生错误的数据：

rx  LagBin  rx  LagBin
139.1  nan  139.1   
139.1  0    139.1   0
139.1  0    139.1   0
139.1  0    139.1   0
141.1  nan  141.1   
141.1  10   141.1   10
141.1  20   141.1   20
193    nan  193 
193    50   193     50
193    20   193     20
193    3600 193     3600
193    50   193     50
193    0    193     0
193    20   193     20
193    10   193     10
193    110  193     110
193    80   193     80
193    460  193     460
193    30   193     30
193    0    193     0

while the original routine that works produces this: 而有效的原始例程会产生以下结果：

rx  vals    rx  vals
1   nan     1   
1   10      1   10
1   10      1   10 
1   20      1   20
2.1 nan     2.1 
2.1 10      2.1 10
2.1 20      2.1 20
2.1 20      2.1 20

What is different about these datasets that produces this error? 这些产生此错误的数据集有何不同？

Answer 1

If I'm understanding your question correctly I believe what you want is: 如果我正确理解了您的问题，我相信您想要的是：

Hist.agg(len).dropna()

The full code implementation looks like this: 完整的代码实现如下所示：

d={'rx': [139.1,139.1,139.1,139.1,141.1,141.1,141.1,193,193,193,193,193,193,193,193,193,193,193,193,193],
     'vals': [nan,0,0,0,nan,10,20,nan,50,20,3600,50,0,20,10,110,80,460,30,0]}
df=pd.DataFrame(d)

arrays = [df.rx,df.vals]                    
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])           
df.index = index

Hist=df.groupby(level=('rx','vals'))
print(Hist.agg(len).dropna())

Where df looks like: df如下所示：

             rx  vals
rx    vals             
139.1 NaN   139.1   NaN
      0     139.1     0
      0     139.1     0
      0     139.1     0
141.1 NaN   141.1   NaN
      10    141.1    10
      20    141.1    20
193.0 NaN   193.0   NaN
      50    193.0    50
      20    193.0    20
      3600  193.0  3600
      50    193.0    50
      0     193.0     0
      20    193.0    20
      10    193.0    10
      110   193.0   110
      80    193.0    80
      460   193.0   460
      30    193.0    30
      0     193.0     0

And the line Hist.agg(len).dropna() looks like: 而Hist.agg(len).dropna()看起来像：

             rx  vals
rx    vals          
139.1 0      3     3
141.1 10     1     1
      20     1     1
193.0 0      2     2
      10     1     1
      20     2     2
      30     1     1
      50     2     2
      80     1     1
      110    1     1
      460    1     1
      3600   1     1

Answer 2

That looks right---I have been tinkering with groupby and came up with this solution, which seems more elegant, and does not require explicitly dealing with the na's: 看起来不错-我一直在修改groupby并提出了这个解决方案，它看起来更优雅，并且不需要显式处理na：

df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)

df1["rx"].groupby((df1["rx"],df1["LagBin"])).count().reset_index(name="Count")
print(LagCount)

which gives me: 这给了我：

       rx  LagBin  Count
0   139.1       0      3
1   141.1      10      1
2   141.1      20      1
3   193.0       0      2
4   193.0      10      1
5   193.0      20      2
6   193.0      30      1
7   193.0      50      2
8   193.0      80      1
9   193.0     110      1
10  193.0     460      1
11  193.0    3600      1

I like this better, because I retain values as values and not indices, which I assume will make life easier later for plotting. 我更喜欢这种方式，因为我将值保留为值而不是索引，我认为这样会使以后的绘制工作变得更轻松。

在Pandas数据框中计算缺少值的分组数据

问题描述

2 个解决方案

解决方案1
1 2015-01-25 19:20:43

解决方案2
0 已采纳 2015-01-25 20:25:45

在Pandas数据框中计算缺少值的分组数据

问题描述

2 个解决方案

解决方案1 1 2015-01-25 19:20:43

解决方案2 0 已采纳 2015-01-25 20:25:45

解决方案1
1 2015-01-25 19:20:43

解决方案2
0 已采纳 2015-01-25 20:25:45