简体   繁体   English

防止pandas在read_csv中自动推断类型

[英]Prevent pandas from automatically inferring type in read_csv

I have a #-separated file with three columns: the first is integer, the second looks like a float, but isn't, and the third is a string. 我有一个#-separated文件有三列:第一列是整数,第二列看起来像浮点数,但不是,第三列是字符串。 I attempt to load this directly into python with pandas.read_csv 我尝试使用pandas.read_csv将其直接加载到python中

In [149]: d = pandas.read_csv('resources/names/fos_names.csv',  sep='#', header=None, names=['int_field', 'floatlike_field', 'str_field'])

In [150]: d
Out[150]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1673 entries, 0 to 1672
Data columns:
int_field          1673  non-null values
floatlike_field    1673  non-null values
str_field          1673  non-null values
dtypes: float64(1), int64(1), object(1)

pandas tries to be smart and automatically convert fields to a useful type. pandas试图变得聪明并自动将字段转换为有用的类型。 The issue is that I don't actually want it to do so (if I did, I'd used the converters argument). 问题是我实际上并不希望它这样做(如果我这样做,我会使用converters参数)。 How can I prevent pandas from converting types automatically? 如何防止pandas自动转换类型?

I'm planning to add explicit column dtypes in the upcoming file parser engine overhaul in pandas 0.10. 我计划在pandas 0.10即将进行的文件解析器引擎大修中添加显式列dtypes。 Can't commit myself 100% to it but it should be pretty simple with the new infrastructure coming together (http://wesmckinney.com/blog/?p=543). 无法100%承诺,但新基础设施的整合应该非常简单(http://wesmckinney.com/blog/?p=543)。

I think your best bet is to read the data in as a record array first using numpy. 我认为你最好的选择是首先使用numpy将数据作为记录数组读取。

# what you described:
In [15]: import numpy as np
In [16]: import pandas
In [17]: x = pandas.read_csv('weird.csv')

In [19]: x.dtypes
Out[19]: 
int_field            int64
floatlike_field    float64  # what you don't want?
str_field           object

In [20]: datatypes = [('int_field','i4'),('floatlike','S10'),('strfield','S10')]

In [21]: y_np = np.loadtxt('weird.csv', dtype=datatypes, delimiter=',', skiprows=1)

In [22]: y_np
Out[22]: 
array([(1, '2.31', 'one'), (2, '3.12', 'two'), (3, '1.32', 'three ')], 
      dtype=[('int_field', '<i4'), ('floatlike', '|S10'), ('strfield', '|S10')])

In [23]: y_pandas = pandas.DataFrame.from_records(y_np)

In [25]: y_pandas.dtypes
Out[25]: 
int_field     int64
floatlike    object  # better?
strfield     object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM