[英]Prevent pandas from automatically inferring type in read_csv
I have a #-separated file with three columns: the first is integer, the second looks like a float, but isn't, and the third is a string. 我有一个#-separated文件有三列:第一列是整数,第二列看起来像浮点数,但不是,第三列是字符串。 I attempt to load this directly into python with pandas.read_csv
我尝试使用pandas.read_csv
将其直接加载到python中
In [149]: d = pandas.read_csv('resources/names/fos_names.csv', sep='#', header=None, names=['int_field', 'floatlike_field', 'str_field'])
In [150]: d
Out[150]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1673 entries, 0 to 1672
Data columns:
int_field 1673 non-null values
floatlike_field 1673 non-null values
str_field 1673 non-null values
dtypes: float64(1), int64(1), object(1)
pandas
tries to be smart and automatically convert fields to a useful type. pandas
试图变得聪明并自动将字段转换为有用的类型。 The issue is that I don't actually want it to do so (if I did, I'd used the converters
argument). 问题是我实际上并不希望它这样做(如果我这样做,我会使用converters
参数)。 How can I prevent pandas
from converting types automatically? 如何防止pandas
自动转换类型?
I'm planning to add explicit column dtypes in the upcoming file parser engine overhaul in pandas 0.10. 我计划在pandas 0.10即将进行的文件解析器引擎大修中添加显式列dtypes。 Can't commit myself 100% to it but it should be pretty simple with the new infrastructure coming together (http://wesmckinney.com/blog/?p=543). 无法100%承诺,但新基础设施的整合应该非常简单(http://wesmckinney.com/blog/?p=543)。
I think your best bet is to read the data in as a record array first using numpy. 我认为你最好的选择是首先使用numpy将数据作为记录数组读取。
# what you described:
In [15]: import numpy as np
In [16]: import pandas
In [17]: x = pandas.read_csv('weird.csv')
In [19]: x.dtypes
Out[19]:
int_field int64
floatlike_field float64 # what you don't want?
str_field object
In [20]: datatypes = [('int_field','i4'),('floatlike','S10'),('strfield','S10')]
In [21]: y_np = np.loadtxt('weird.csv', dtype=datatypes, delimiter=',', skiprows=1)
In [22]: y_np
Out[22]:
array([(1, '2.31', 'one'), (2, '3.12', 'two'), (3, '1.32', 'three ')],
dtype=[('int_field', '<i4'), ('floatlike', '|S10'), ('strfield', '|S10')])
In [23]: y_pandas = pandas.DataFrame.from_records(y_np)
In [25]: y_pandas.dtypes
Out[25]:
int_field int64
floatlike object # better?
strfield object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.