简体   繁体   English

Python genfromtext多种数据类型

[英]Python genfromtext multiple datatypes

I would like to read in a csv file using genfromtxt. 我想使用genfromtxt读取csv文件。 I have six columns that are float, and one column that is a string. 我有六列浮点数,一列是一个字符串。

How do I set the datatype so that the float columns will be read in as floats and the string column will be read in as strings? 如何设置数据类型以便将浮点列作为浮点读入,字符串列将作为字符串读入? I tried dtype='void' but that is not working. 我试过dtype ='void'但是没有用。

Suggestions? 建议?

Thanks 谢谢

.csv file .csv文件

999.9, abc, 34, 78, 12.3
1.3, ghf, 12, 8.4, 23.7
101.7, evf, 89, 2.4, 11.3



x = sys.argv[1]
f = open(x, 'r')
y = np.genfromtxt(f, delimiter = ',', dtype=[('f0', '<f8'), ('f1', 'S4'), (\
'f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8')])

ionenergy = y[:,0]
units = y[:,1]

Error: 错误:

ionenergy = y[:,0]
IndexError: invalid index

I don't get this error when I specify a single data type.. 当我指定单个数据类型时,我没有收到此错误。

dtype=None tells genfromtxt to guess the appropriate dtype. genfromtxt dtype=None告诉genfromtxt猜测适当的genfromtxt

From the docs : 来自文档

dtype: dtype, optional dtype:dtype,可选

Data type of the resulting array. 结果数组的数据类型。 If None, the dtypes will be determined by the contents of each column, individually. 如果为None,则dtypes将由每列的内容单独确定。

(my emphasis.) (我的重点。)


Since your data is comma-separated, be sure to include delimiter=',' or else np.genfromtxt will interpret each column (execpt the last) as including a string character (the comma) and therefore mistakenly assign a string dtype to each of those columns. 由于您的数据以逗号分隔,请务必包含delimiter=','或者np.genfromtxt将每列( np.genfromtxt最后一个)解释为包含字符串字符(逗号),因此错误地为每个列分配一个字符串dtype那些专栏。

For example: 例如:

import numpy as np

arr = np.genfromtxt('data', dtype=None, delimiter=',')

print(arr.dtype)
# [('f0', '<f8'), ('f1', 'S4'), ('f2', '<i4'), ('f3', '<f8'), ('f4', '<f8')]

This shows the names and dtypes of each column. 这显示了每列的名称和dtypes。 For example, ('f3', <f8) means the fourth column has name 'f3' and is of dtype '<i4. 例如, ('f3', <f8)表示第四列的名称为'f3'且dtype为'<i4。 The i means it is an integer dtype. i表示它是整数dtype。 If you need the third column to be a float dtype then there are a few options. 如果你需要第三列是float dtype,那么有几个选项。

  1. You could manually edit the data by adding a decimal point in the third column to force genfromtxt to interpret values in that column to be of a float dtype. 您可以通过在第三列中添加小数点来手动编辑数据,以强制genfromtxt将该列中的值解释为float dtype。
  2. You could supply the dtype explicitly in the call to genfromtxt 您可以在对genfromtxt的调用中显式提供dtype

     arr = np.genfromtxt( 'data', delimiter=',', dtype=[('f0', '<f8'), ('f1', 'S4'), ('f2', '<f4'), ('f3', '<f8'), ('f4', '<f8')]) 

print(arr)
# [(999.9, ' abc', 34, 78.0, 12.3) (1.3, ' ghf', 12, 8.4, 23.7)
#  (101.7, ' evf', 89, 2.4, 11.3)]

print(arr['f2'])
# [34 12 89]

The error message IndexError: invalid index is being generated by the line 错误消息IndexError: invalid index行生成IndexError: invalid index

ionenergy = y[:,0]

When you have mixed dtypes, np.genfromtxt returns a structured array . 当你有混合dtypes时, np.genfromtxt返回一个结构化数组 You need to read up on structured arrays because the syntax for accessing columns differs from the syntax used for plain arrays of homogenous dtype. 您需要阅读结构化数组,因为访问列的语法不同于用于同类dtype的普通数组的语法。

Instead of y[:, 0] , to access the first column of the structured array y , use 而不是y[:, 0] ,要访问结构化数组y的第一列,请使用

y['f0']

Or, better yet, supply the names parameter in np.genfromtxt , so you can use a more relevant column name, like y['ionenergy'] : 或者,更好的是,在np.genfromtxt提供names参数,这样您就可以使用更相关的列名,例如y['ionenergy']

import numpy as np
arr = np.genfromtxt(
    'data', delimiter=',', dtype=None,
    names=['ionenergy', 'foo', 'bar', 'baz', 'quux', 'corge'])

print(arr['ionenergy'])
# [ 999.9    1.3  101.7]

Please try this: 请试试这个:

import numpy

ionenergy = y.iloc[:,0]
units = y.iloc[:,1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM