简体   繁体   English

熊猫read_csv():将0保持为0(不将其转换为NaN)

[英]Pandas read_csv(): keep 0 as 0 (not convert it to NaN)

I am trying to read a csv file, of which a sample: 我正在尝试读取一个csv文件,其中包括一个示例:

datetime,check,lat,lon,co_alpha,atn,status,bc
2012-10-27 15:00:59,2,0,0,2.427,,,
2012-10-27 15:01:00,2,0,0,2.407,,,
2012-10-27 15:02:49,2,0,0,2.207,-17.358,0,-16162
2012-10-27 15:02:50,2,0,0,2.207,-17.354,0,8192
2012-10-27 15:02:51,1,0,0,2.207,-17.358,0,-8152
2012-10-27 15:02:52,1,0,0,2.207,-17.358,0,648
2012-10-27 15:06:03,0,51.195076,4.444407,2.349,-17.289,0,4909
2012-10-27 15:06:04,0,51.195182,4.44427,2.344,-17.289,0,587
2012-12-05 09:21:34,,,,,42.960,1,16430
2012-12-05 09:21:35,,,,,42.962,1,3597

The problem I encounter is that in columns with only ints, the 0's are converted to NaN (eg columns 'check' and 'status', these are columns with only ints, but the column is read as floats because there are real missing values). 我遇到的问题是,在只有整数的列中,0转换为NaN(例如,列“ check”和“ status”,这些是仅有整数的列,但由于存在真正的缺失值,因此该列被读取为浮点数) 。 But I only want the empty values to be converted to NaN, and not the zeros. 但是我只希望将空值转换为NaN,而不是零。

This is what I get: 这是我得到的:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358     NaN -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354     NaN   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358     NaN  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358     NaN    648
2012-10-27 15:06:03    NaN  51.195076  4.444407     2.349 -17.289     NaN   4909
2012-10-27 15:06:04    NaN  51.195182  4.444270     2.344 -17.289     NaN    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

So, in the columns 'check' and 'status', there are to many NaN's. 因此,在“检查”和“状态”列中,有许多NaN。 In the 'lat' and 'lon' columns the 0's are not converted to NaN's. 在“纬度”和“经度”列中,0不会转换为NaN。

  • Using na_values='' and keep_default_na=False does not help. 使用na_values=''keep_default_na=False没有帮助。 Is there a way to specify to not convert int 0's to NaN? 有没有一种方法可以指定不将int 0转换为NaN? Or is this a bug? 还是这是一个错误?

  • I could specify the dtype of the specific columns as int with the dtype keyword. 我可以使用dtype关键字将特定列的dtype指定为int。 This keeps the 0's as 0's, but the problem is that those columns also contain real NaN's (empty values). 这将0保持为0,但是问题在于这些列还包含真实的NaN(空值)。 So, in this case, these values are also converted to 0's as in an int column you cannot have NaN's. 因此,在这种情况下,这些值也将转换为0,因为在int列中您不能使用NaN。 For this reason, I have to keep all columns as floats. 因此,我必须将所有列都保留为浮点数。


EDIT: after upgrading to pandas 0.10.1, it works as expected even without specifying keep_default_na and na_values : 编辑:升级到熊猫0.10.1后,即使不指定keep_default_nana_values ,它也可以按预期工作:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

You have to first set keep_default_na to False : 您必须首先将keep_default_na设置为False

df = pd.read_clipboard(sep=',', index_col=0, keep_default_na=False, na_values='')

In [2]: df
Out[2]: 
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

From the doc-string of read_tables : read_tables的文档字符串中:

keep_default_na : bool, default True keep_default_na :布尔值,默认为True
If na_values are specified and keep_default_na is False the default NaN 如果na_values指定和keep_default_naFalse默认NaN
values are overridden, otherwise they're appended to 值会被覆盖,否则会附加到

na_values : list-like or dict, default None na_values :类似列表或字典,默认为None
Additional strings to recognize as NA/NaN. 识别为NA / NaN的其他字符串。 If dict passed, specific per-column NA values 如果dict通过,则特定的每列NA值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM