简体   繁体   English

熊猫的groupby统计数据中的NaN值

[英]NaN values in the groupby statistics in pandas

I am working with a pandas DataFrame Top15 that contains a population data of 15 countries in the world. 我正在使用pandas DataFrame Top15 ,其中包含世界上15个国家的人口数据。

                      Population
Country                         
China               1.367645e+09
United States       3.176154e+08
Japan               1.274094e+08
United Kingdom      6.387097e+07
Russian Federation  1.435000e+08
Canada              3.523986e+07
Germany             8.036970e+07
India               1.276731e+09
France              6.383735e+07
South Korea         4.980543e+07
Italy               5.990826e+07
Spain               4.644340e+07
Iran                7.707563e+07
Australia           2.331602e+07
Brazil              2.059153e+08

Now I want to see the continent-wise statistics for these data. 现在,我想查看这些数据的按洲统计。 So I created a groupby object using a dictionary: 因此,我使用字典创建了一个groupby对象:

df = Top15.groupby(ContinentDict)

where: 哪里:

ContinentDict  = {'China':'Asia', 
              'United States':'North America', 
              'Japan':'Asia', 
              'United Kingdom':'Europe', 
              'Russian Federation':'Europe', 
              'Canada':'North America', 
              'Germany':'Europe', 
              'India':'Asia',
              'France':'Europe', 
              'South Korea':'Asia', 
              'Italy':'Europe', 
              'Spain':'Europe', 
              'Iran':'Asia',
              'Australia':'Australia', 
              'Brazil':'South America'}

and then I am creating a new dataframe that will contain the various statistics: 然后创建一个包含各种统计信息的新数据框:

new_df = pd.DataFrame({'size' : df.size().values, 'sum' : df.sum().values, 'mean' : df.mean().values, 'std' : df.std().values}, index = df.groups.keys())

I get the following output: 我得到以下输出:

                       mean  size           std           sum
North America  5.797333e+08     5  6.790979e+08  2.898666e+09
Asia           2.331602e+07     1           NaN  2.331602e+07
South America  7.632161e+07     6  3.464767e+07  4.579297e+08
Europe         1.764276e+08     2  1.996696e+08  3.528552e+08
Australia      2.059153e+08     1           NaN  2.059153e+08

As you can see, in the standard deviation column, two NaN values (For Asia and Australia). 如您所见,在“标准偏差”列中,有两个NaN值(对于亚洲和澳大利亚)。

After this, I tried looking at the individual values 之后,我尝试查看各个值

df.std()

and I get: 我得到:

    Asia             6.790979e+08
    Australia                 NaN
    Europe           3.464767e+07
    North America    1.996696e+08
    South America             NaN

Name: Population, dtype: float64

Now Asia is completely fine and South America is not! 现在亚洲完全没事,南美却没有! I do not have any NaN values in my original dataframe. 我的原始数据框中没有任何NaN值。 How does one explain this strange behavior and how to fix it? 如何解释这种奇怪的行为以及如何解决呢?

That is not a good way to get groupby statistics. 这不是获取groupby统计信息的好方法。 Just compute the statistics directly on the grouped object by passing a list of function names to agg : 通过将函数名称列表传递给agg直接在分组对象上直接计算统计信息:

>>> d.groupby(ContinentDict).Population.agg(['size', 'mean', 'std', 'sum'])
               size          mean           std           sum
Asia              5  5.797333e+08  6.790979e+08  2.898666e+09
Australia         1  2.331602e+07           NaN  2.331602e+07
Europe            6  7.632161e+07  3.464767e+07  4.579297e+08
North America     2  1.764276e+08  1.996697e+08  3.528553e+08
South America     1  2.059153e+08           NaN  2.059153e+08

(You can use strings because all the functions you're using are builtin pandas methods and so are special cased. If you wanted to compute any custom function, you'd pass the actual function object.) (您可以使用字符串,因为正在使用的所有函数都是内置的pandas方法,因此是特殊情况。如果要计算任何自定义函数,则需要传递实际的函数对象。)

As for the NaNs, those occur where there was only one country on a given continent. 至于NaN,那些发生在给定大陆上只有一个国家的地方。 The sample standard deviation of a single number is undefined, and pandas uses the sample standard deviation by default. 单个数字的样本标准偏差未定义,熊猫默认使用样本标准偏差。 (You can get the population stdev by calling .std(ddof=0) , which will give you zero for these cases.) (您可以通过调用.std(ddof=0)来获得总体标准.std(ddof=0) ,对于这些情况,这将为您提供零。)

The reason you were seeing the NaNs in different places before is because you explicitly passed .groups.keys() as an index. 您之前在不同位置看到NaN的原因是因为您显式传递了.groups.keys()作为索引。 .groups is just a dictionary, so its .key() may be in any arbitrary order. .groups只是一个字典,因此它的.key()可以是任意顺序。 What happened is that the results you got from computing mean, std, etc. were in a different order than the keys you got from the dict. 发生的事情是,您从计算均值,标准差等获得的结果与从dict获得的键的顺序不同。 There's no need to compute the various summary statistics separately as you were doing; 您无需像以前那样分别计算各种汇总统计信息; you can do them all at once with .agg and pandas will make sure everything is aligned for you. 您可以使用.agg一次完成所有操作,而pandas会确保一切都适合您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM