简体   繁体   English

在Pandas数据框中计算缺少值的分组数据

[英]Counting grouped data with missing values in pandas dataframe

I am trying to do something like this, but on a much larger dataframe (called Clean): 我正在尝试做这样的事情,但是在更大的数据帧(称为Clean)上:

d={'rx': [1,1,1,1,2.1,2.1,2.1,2.1],
     'vals': [NaN,10,10,20,NaN,10,20,20]}
df=DataFrame(d)


arrays = [df.rx,df.vals]                    
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])           
df.index = index

Hist=df.groupby(level=('rx','vals'))
Hist.count('vals')

This seems to work just fine, but when I run the same concept on even a subset of the Clean dataframe (substituting a column 'LagBin' for 'vals') I get an error: 这似乎很好用,但是当我甚至在Clean数据帧的一个子集上运行相同的概念时(用“ LagBin”列替换为“ vals”),我得到一个错误:

df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)

arrays = [df1.rx,df1.LagBin]                    
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','LagBin'])            
df1.index = index

Hist=df1.groupby(level=('rx','LagBin'))
Hist.count('LagBin')

Specifically, the Hist.count('LagBin') produces a value error: 具体来说,Hist.count('LagBin')会产生值错误:

ValueError: Cannot convert NA to integer

I have looked at the data structure and that all seems exactly the same. 我已经看过数据结构,而且看起来似乎完全一样。

Here is the data that produces the error: 这是产生错误的数据:

rx  LagBin  rx  LagBin
139.1  nan  139.1   
139.1  0    139.1   0
139.1  0    139.1   0
139.1  0    139.1   0
141.1  nan  141.1   
141.1  10   141.1   10
141.1  20   141.1   20
193    nan  193 
193    50   193     50
193    20   193     20
193    3600 193     3600
193    50   193     50
193    0    193     0
193    20   193     20
193    10   193     10
193    110  193     110
193    80   193     80
193    460  193     460
193    30   193     30
193    0    193     0

while the original routine that works produces this: 而有效的原始例程会产生以下结果:

rx  vals    rx  vals
1   nan     1   
1   10      1   10
1   10      1   10 
1   20      1   20
2.1 nan     2.1 
2.1 10      2.1 10
2.1 20      2.1 20
2.1 20      2.1 20

What is different about these datasets that produces this error? 这些产生此错误的数据集有何不同?

If I'm understanding your question correctly I believe what you want is: 如果我正确理解了您的问题,我相信您想要的是:

Hist.agg(len).dropna()

The full code implementation looks like this: 完整的代码实现如下所示:

d={'rx': [139.1,139.1,139.1,139.1,141.1,141.1,141.1,193,193,193,193,193,193,193,193,193,193,193,193,193],
     'vals': [nan,0,0,0,nan,10,20,nan,50,20,3600,50,0,20,10,110,80,460,30,0]}
df=pd.DataFrame(d)

arrays = [df.rx,df.vals]                    
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])           
df.index = index

Hist=df.groupby(level=('rx','vals'))
print(Hist.agg(len).dropna())

Where df looks like: df如下所示:

             rx  vals
rx    vals             
139.1 NaN   139.1   NaN
      0     139.1     0
      0     139.1     0
      0     139.1     0
141.1 NaN   141.1   NaN
      10    141.1    10
      20    141.1    20
193.0 NaN   193.0   NaN
      50    193.0    50
      20    193.0    20
      3600  193.0  3600
      50    193.0    50
      0     193.0     0
      20    193.0    20
      10    193.0    10
      110   193.0   110
      80    193.0    80
      460   193.0   460
      30    193.0    30
      0     193.0     0

And the line Hist.agg(len).dropna() looks like: Hist.agg(len).dropna()看起来像:

             rx  vals
rx    vals          
139.1 0      3     3
141.1 10     1     1
      20     1     1
193.0 0      2     2
      10     1     1
      20     2     2
      30     1     1
      50     2     2
      80     1     1
      110    1     1
      460    1     1
      3600   1     1

That looks right---I have been tinkering with groupby and came up with this solution, which seems more elegant, and does not require explicitly dealing with the na's: 看起来不错-我一直在修改groupby并提出了这个解决方案,它看起来更优雅,并且不需要显式处理na:

df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)

df1["rx"].groupby((df1["rx"],df1["LagBin"])).count().reset_index(name="Count")
print(LagCount)

which gives me: 这给了我:

       rx  LagBin  Count
0   139.1       0      3
1   141.1      10      1
2   141.1      20      1
3   193.0       0      2
4   193.0      10      1
5   193.0      20      2
6   193.0      30      1
7   193.0      50      2
8   193.0      80      1
9   193.0     110      1
10  193.0     460      1
11  193.0    3600      1

I like this better, because I retain values as values and not indices, which I assume will make life easier later for plotting. 我更喜欢这种方式,因为我将值保留为值而不是索引,我认为这样会使以后的绘制工作变得更轻松。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM