[英]Pandas read_csv(): keep 0 as 0 (not convert it to NaN)
I am trying to read a csv file, of which a sample: 我正在尝试读取一个csv文件,其中包括一个示例:
datetime,check,lat,lon,co_alpha,atn,status,bc
2012-10-27 15:00:59,2,0,0,2.427,,,
2012-10-27 15:01:00,2,0,0,2.407,,,
2012-10-27 15:02:49,2,0,0,2.207,-17.358,0,-16162
2012-10-27 15:02:50,2,0,0,2.207,-17.354,0,8192
2012-10-27 15:02:51,1,0,0,2.207,-17.358,0,-8152
2012-10-27 15:02:52,1,0,0,2.207,-17.358,0,648
2012-10-27 15:06:03,0,51.195076,4.444407,2.349,-17.289,0,4909
2012-10-27 15:06:04,0,51.195182,4.44427,2.344,-17.289,0,587
2012-12-05 09:21:34,,,,,42.960,1,16430
2012-12-05 09:21:35,,,,,42.962,1,3597
The problem I encounter is that in columns with only ints, the 0's are converted to NaN (eg columns 'check' and 'status', these are columns with only ints, but the column is read as floats because there are real missing values). 我遇到的问题是,在只有整数的列中,0转换为NaN(例如,列“ check”和“ status”,这些是仅有整数的列,但由于存在真正的缺失值,因此该列被读取为浮点数) 。 But I only want the empty values to be converted to NaN, and not the zeros. 但是我只希望将空值转换为NaN,而不是零。
This is what I get: 这是我得到的:
>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
check lat lon co_alpha atn status bc
datetime
2012-10-27 15:00:59 2 0.000000 0.000000 2.427 NaN NaN NaN
2012-10-27 15:01:00 2 0.000000 0.000000 2.407 NaN NaN NaN
2012-10-27 15:02:49 2 0.000000 0.000000 2.207 -17.358 NaN -16162
2012-10-27 15:02:50 2 0.000000 0.000000 2.207 -17.354 NaN 8192
2012-10-27 15:02:51 1 0.000000 0.000000 2.207 -17.358 NaN -8152
2012-10-27 15:02:52 1 0.000000 0.000000 2.207 -17.358 NaN 648
2012-10-27 15:06:03 NaN 51.195076 4.444407 2.349 -17.289 NaN 4909
2012-10-27 15:06:04 NaN 51.195182 4.444270 2.344 -17.289 NaN 587
2012-12-05 09:21:34 NaN NaN NaN NaN 42.960 1 16430
2012-12-05 09:21:35 NaN NaN NaN NaN 42.962 1 3597
So, in the columns 'check' and 'status', there are to many NaN's. 因此,在“检查”和“状态”列中,有许多NaN。 In the 'lat' and 'lon' columns the 0's are not converted to NaN's. 在“纬度”和“经度”列中,0不会转换为NaN。
Using na_values=''
and keep_default_na=False
does not help. 使用na_values=''
和keep_default_na=False
没有帮助。 Is there a way to specify to not convert int 0's to NaN? 有没有一种方法可以指定不将int 0转换为NaN? Or is this a bug? 还是这是一个错误?
I could specify the dtype of the specific columns as int with the dtype
keyword. 我可以使用dtype
关键字将特定列的dtype指定为int。 This keeps the 0's as 0's, but the problem is that those columns also contain real NaN's (empty values). 这将0保持为0,但是问题在于这些列还包含真实的NaN(空值)。 So, in this case, these values are also converted to 0's as in an int column you cannot have NaN's. 因此,在这种情况下,这些值也将转换为0,因为在int列中您不能使用NaN。 For this reason, I have to keep all columns as floats. 因此,我必须将所有列都保留为浮点数。
EDIT: after upgrading to pandas 0.10.1, it works as expected even without specifying keep_default_na
and na_values
: 编辑:升级到熊猫0.10.1后,即使不指定keep_default_na
和na_values
,它也可以按预期工作:
>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
check lat lon co_alpha atn status bc
datetime
2012-10-27 15:00:59 2 0.000000 0.000000 2.427 NaN NaN NaN
2012-10-27 15:01:00 2 0.000000 0.000000 2.407 NaN NaN NaN
2012-10-27 15:02:49 2 0.000000 0.000000 2.207 -17.358 0 -16162
2012-10-27 15:02:50 2 0.000000 0.000000 2.207 -17.354 0 8192
2012-10-27 15:02:51 1 0.000000 0.000000 2.207 -17.358 0 -8152
2012-10-27 15:02:52 1 0.000000 0.000000 2.207 -17.358 0 648
2012-10-27 15:06:03 0 51.195076 4.444407 2.349 -17.289 0 4909
2012-10-27 15:06:04 0 51.195182 4.444270 2.344 -17.289 0 587
2012-12-05 09:21:34 NaN NaN NaN NaN 42.960 1 16430
2012-12-05 09:21:35 NaN NaN NaN NaN 42.962 1 3597
You have to first set keep_default_na
to False
: 您必须首先将keep_default_na
设置为False
:
df = pd.read_clipboard(sep=',', index_col=0, keep_default_na=False, na_values='')
In [2]: df
Out[2]:
check lat lon co_alpha atn status bc
datetime
2012-10-27 15:00:59 2 0.000000 0.000000 2.427 NaN NaN NaN
2012-10-27 15:01:00 2 0.000000 0.000000 2.407 NaN NaN NaN
2012-10-27 15:02:49 2 0.000000 0.000000 2.207 -17.358 0 -16162
2012-10-27 15:02:50 2 0.000000 0.000000 2.207 -17.354 0 8192
2012-10-27 15:02:51 1 0.000000 0.000000 2.207 -17.358 0 -8152
2012-10-27 15:02:52 1 0.000000 0.000000 2.207 -17.358 0 648
2012-10-27 15:06:03 0 51.195076 4.444407 2.349 -17.289 0 4909
2012-10-27 15:06:04 0 51.195182 4.444270 2.344 -17.289 0 587
2012-12-05 09:21:34 NaN NaN NaN NaN 42.960 1 16430
2012-12-05 09:21:35 NaN NaN NaN NaN 42.962 1 3597
From the doc-string of read_tables
: 从read_tables
的文档字符串中:
keep_default_na
: bool, default Truekeep_default_na
:布尔值,默认为True
Ifna_values
are specified andkeep_default_na
isFalse
the defaultNaN
如果na_values
指定和keep_default_na
是False
默认NaN
values are overridden, otherwise they're appended to 值会被覆盖,否则会附加到
na_values
: list-like or dict, defaultNone
na_values
:类似列表或字典,默认为None
Additional strings to recognize as NA/NaN. 识别为NA / NaN的其他字符串。 If dict passed, specific per-column NA values 如果dict通过,则特定的每列NA值
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.