使用pandas.read_csv从csv文件加载数据时如何指定dtype？

Question

I have some text files with the following format: 我有一些文本文件格式如下：

000423|东阿阿胶|     300|1|0.15000|            |
000425|徐工机械|     600|1|0.15000|            |
000503|海虹控股|     400|1|0.15000|            |
000522|白云山Ａ|        |2|       |    1982.080|
000527|美的电器|     900|1|0.15000|            |
000528|柳    工|     300|1|0.15000|            |

when I use read_csv to load them into DataFrame, it doesn't generate correct dtype for some columns. 当我使用read_csv将它们加载到DataFrame时，它不会为某些列生成正确的dtype。 For example, the first column is parsed as int, not unicode str, the third column is parsed as unicode str, not int, because of one missing data... Is there a way to preset the dtype of the DataFrame, just like the numpy.genfromtxt does? 例如，第一列被解析为int，而不是unicode str，第三列被解析为unicode str，而不是int，因为缺少一个数据...有没有办法预设DataFrame的dtype，就像numpy.genfromtxt呢？

Updates: I used read_csv like this which caused the problem: 更新：我使用read_csv这样会导致问题：

data = pandas.read_csv(StringIO(etf_info), sep='|', skiprows=14, index_col=0, 
                       skip_footer=1, names=['ticker', 'name', 'vol', 'sign', 
                       'ratio', 'cash', 'price'], encoding='gbk')

In order to solve both the dtype and encoding problems, I need to use unicode() and numpy.genfromtxt first: 为了解决dtype和编码问题，我需要首先使用unicode()和numpy.genfromtxt ：

etf_info = unicode(urllib2.urlopen(etf_url).read(), 'gbk')
nd_data = np.genfromtxt(StringIO(etf_info), delimiter='|', 
                        skiprows=14, skip_footer=1, dtype=ETF_DTYPE)
data = pandas.DataFrame(nd_data, index=nd_data['ticker'],
                        columns=['name', 'vol', 'sign', 
                                 'ratio', 'cash', 'price'])

It would be nice if read_csv can add dtype and usecols settings. 这将是很好，如果read_csv可以增加dtype和usecols设置。 Sorry for my greed. 抱歉，我的贪婪。 ^_^ ^ _ ^

Answer 1

Simply put: no, not yet. 简单地说：不，还没有。 More work (read: more active developers) is needed on this particular area. 在这个特定领域需要做更多工作（阅读：更活跃的开发人员）。 If you could post how you're using read_csv it might help. 如果你可以发布你如何使用read_csv它可能会有所帮助。 I suspect that the whitespace between the bars may be the problem 我怀疑条之间的空白可能是问题所在

EDIT: this is now obsolete. 编辑：现在已经过时了。 This behavior is covered natively by read_csv read_csv本身涵盖了此行为

Answer 2

You can now use dtype in read_csv . 您现在可以在read_csv中使用dtype 。

PS: Kudos to Wes McKinney for answering, it feels quite awkward to contradict the "past Wes". PS：感谢Wes McKinney的回答，与“过去的Wes”相矛盾感觉很尴尬。

使用pandas.read_csv从csv文件加载数据时如何指定dtype？

问题描述

2 个解决方案

解决方案1
4 已采纳 2012-03-15 00:13:16

解决方案2
1 2017-01-28 16:30:05

使用pandas.read_csv从csv文件加载数据时如何指定dtype？

问题描述

2 个解决方案

解决方案1 4 已采纳 2012-03-15 00:13:16

解决方案2 1 2017-01-28 16:30:05

解决方案1
4 已采纳 2012-03-15 00:13:16

解决方案2
1 2017-01-28 16:30:05