简体   繁体   English

熊猫dropna在寻找均值方面未按预期工作

[英]pandas dropna not working as expected on finding mean

When I run the code below I get the error: 当我运行下面的代码时,出现错误:

TypeError: 'NoneType' object has no attribute ' getitem ' TypeError:“ NoneType”对象没有属性“ getitem

    import pyarrow 
    import pandas
    import pyarrow.parquet as pq

    df = pq.read_table("file.parquet").to_pandas()
    df = df.iloc[1:,:]
    df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN

    average_age = df["_c2"].mean()
    print average_age

The dataframe looks like this: 数据框如下所示:

         _c0     _c1  _c2    
    0  RecId   Class  Age   
    1      1      1st   29   
    2      2      1st   NA   
    3      3      1st   30  

If I print the df after calling the dropna method, I get 'None'. 如果在调用dropna方法后打印df,则会显示“无”。

Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error? 难道不是要创建一个没有“ NA”的新数据框,然后让我获得平均年龄而不会抛出错误吗?

As per OP's comment, the NA is a string rather than NaN. 根据OP的评论,NA是字符串而不是NaN。 So dropna() is no good here. 所以dropna()在这里不好。 One of many possible options for filtering out the string value 'NA' is: 过滤掉字符串值“ NA”的许多可能选项之一是:

df = df[df["_c2"] != "NA"]

A better option to catch inexact matches (eg with trailing spaces) as suggested by @DJK in the comments: @DJK在注释中建议的一种更好的选择来捕获不精确的匹配项(例如,尾随空格):

df = df[~df["_c2"].str.contains('NA')]

This one should remove any strings rather than only 'NA': 这应该删除所有字符串,而不只是“ NA”:

df = df[df[“_c2”].apply(lambda x: x.isnumeric())]

This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string 即使您在df中的NA为NaN(np.nan),这也将起作用,仅当您的NA为'NA'时,这才不会影响获取列均值。

(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]: 
       _c0  _c1        _c2
count  3.0  0.0   2.000000
mean   2.0  NaN  29.500000
std    1.0  NaN   0.707107
min    1.0  NaN  29.000000
25%    1.5  NaN  29.250000
50%    2.0  NaN  29.500000
75%    2.5  NaN  29.750000
max    3.0  NaN  30.000000

More info 更多信息

df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]: 
   _c0  _c1   _c2
0  NaN  NaN   NaN
1  1.0  NaN  29.0
2  2.0  NaN   NaN
3  3.0  NaN  30.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM