简体   繁体   English

Pandas crosstab() function 与包含 NaN 值的 dataframe 的混淆行为

[英]Confusing behaviour of Pandas crosstab() function with dataframe containing NaN values

I'm using Python 3.4.1 with numpy 0.10.1 and pandas 0.17.0.我正在使用 Python 3.4.1 和 numpy 0.10.1 和 pandas 0.17.0。 I have a large dataframe that lists species and gender of individual animals.我有一个很大的 dataframe,其中列出了个体动物的种类和性别。 It's a real-world dataset and there are, inevitably, missing values represented by NaN.这是一个真实世界的数据集,不可避免地存在由 NaN 表示的缺失值。 A simplified version of the data can be generated as:数据的简化版本可以生成为:

import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                        'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
                        'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})

Printing the dataframe gives:打印 dataframe 给出:

    gender  id species
0     male   1     dog
1   female   2     dog
2   female   3     NaN
3     male   4     dog
4     male   5     dog
5   female   6     cat
6   female   7     cat
7      NaN   8     cat
8     male   9     dog
9     male  10     cat
10  female  11     cat
11    male  12     dog
12  female  13     dog
13  female  14     dog
14    male  15     dog
15  female  16     NaN
16    male  17     cat
17  female  18     cat
18     NaN  19     dog
19    male  20     dog

I want to generate a cross-tabulated table to show number of males and females in each species using the following:我想生成一个交叉表来显示每个物种中雄性和雌性的数量,使用以下内容:

pd.crosstab(tempDF['species'],tempDF['gender'])

This produces the following table:这会产生下表:

gender   female  male
species              
cat           4     2
dog           3     7

Which is what I'd expect.这是我所期望的。 However, if I include the margins=True option, it produces:但是,如果我包含 margins=True 选项,它会产生:

pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)

gender   female  male  All
species                   
cat           4     2    7
dog           3     7   11
All           9     9   20

As you can see, the marginal totals appear to be incorrect, presumably caused by the missing data in the dataframe. Is this intended behaviour?如您所见,边际总数似乎不正确,可能是由于 dataframe 中的数据缺失造成的。这是预期的行为吗? In my mind, it seems very confusing.在我看来,这似乎很混乱。 Surely marginal totals should be totals of rows and columns as they appear in the table and not include any missing data that isn't represented in the table.当然,边际总计应该是表中出现的行和列的总计,并且不包括表中未显示的任何缺失数据。 Including dropna=False does not affect the outcome.包括 dropna=False 不会影响结果。

I can delete any row with a NaN before creating the table but that seems to be a lot of extra work and a lot of extra things to think about when doing an analysis.我可以在创建表之前删除带有 NaN 的任何行,但这似乎是很多额外的工作,并且在进行分析时需要考虑很多额外的事情。 Should I report this as a bug?我应该将此报告为错误吗?

I suppose one workaround would be to convert the NaNs to 'missing' before creating the table and then the cross-tubulation will include columns and rows specifically for missing values: 我想一个解决方法是在创建表之前将NaN转换为'missing',然后交叉管理将包含专门用于缺失值的列和行:

pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)

gender   female  male  missing  All
species                            
cat           4     2        1    7
dog           3     7        1   11
missing       2     0        0    2
All           9     9        2   20

Personally, I would like to see that the default behaviour so I wouldn't have to remember to replace all the NaNs in every crosstab calculation. 就个人而言,我希望看到默认行为,所以我不必记住在每个交叉表计算中替换所有NaN。

You're not the only one experiencing this. 你并不是唯一遇到这种情况的人。 It not only happens with pd.crosstab, but also pd.pivot_table and DataFrame.groupby 它不仅发生在pd.crosstab中,还发生在pd.pivot_table和DataFrame.groupby中

In the docs it says this about groupby excluding Na's: 在文档中,它说的是关于groupby不包括Na的:

NA groups in GroupBy are automatically excluded. GroupBy中的NA组被自动排除。 This behavior is consistent with R, for example. 例如,此行为与R一致。

You can find some good solutions in this post: groupby columns with NaN (missing) values 您可以在这篇文章中找到一些好的解决方案: 具有NaN(缺失)值的groupby列

Maybe one day someone will solve this issue: https://github.com/pandas-dev/pandas/issues/10772 也许有一天有人会解决这个问题: https//github.com/pandas-dev/pandas/issues/10772

You can set dropna=True and then the totals won't include the missing data.您可以设置 dropna=True ,然后总计将不包括丢失的数据。 But if you did want to include the missings then the fillna option is best但是,如果您确实想包括缺失的部分,那么 fillna 选项是最好的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM