简体   繁体   English

通过格式字符串读取带有熊猫的dat文件

[英]reading dat files with pandas by format string

reading a fixed width .dat file in pandas is not very complicated using the pd.read_csv('file.dat', sep='\\s+') or the pd.read_fwf('file.dat', widths=[7, ..]) method. 读取一个固定的宽度.dat在大熊猫文件不是很使用复杂pd.read_csv('file.dat', sep='\\s+')pd.read_fwf('file.dat', widths=[7, ..])方法。 But in the file is also given a format string like this: 但是在文件中还提供了这样的格式字符串:

Format = (i7,1x,i7,1x,i2,1x,i2,1x,i2,1x,f5.1,1x,i4,1x,3i,1x,f4.1,1x,i1,1x,f4.1,1x,i3,1x,i4,1x,i4,1x,i3,1x,i4,2x,i1)

looking at the columns content, I assume the character indicates the datatype (i->int, f->float, x->seperator) and the number is obviously the width of the column. 查看列的内容,我假设字符表示数据类型(i-> int,f-> float,x->分隔符),而数字显然是列的宽度。 Is this a standard notation? 这是标准符号吗? Is there a more pythonic way to read data files by just passing this format string and make scripts save against format changes in the data file? 仅通过传递此格式字符串并使脚本针对数据文件中的格式更改进行保存,是否还有其他Python方式可读取数据文件?

I noticed the format argument for the read_fwf() function, but it takes a list of pairs (int, int) not the type of format string that is given. 我注意到read_fwf()函数的format参数,但是它采用了对(int,int)对的列表,而不是给定的格式字符串的类型。 First rows of the data file: 数据文件的第一行:

list of pairs (int, int) 对列表(int,int)

This is a pretty standard way to indicate format using the C printf convention. 这是使用C printf约定指示格式的非常标准的方法。 The format is only really important if you are trying to write the file in an identical manner. 仅当您尝试以相同的方式写入文件时,格式才真正重要。 For the purpose of reading it all into pandas you don't really care. 为了将它们全部读入熊猫,您并不在乎。 If you want control over the specific data type of each column as you read it in you use the dtype parameter. 如果要在读取时控制每列的​​特定数据类型,请使用dtype参数。 In the example below I said to make column 'a' a 64-bit floag and 'b' a 32-bit int. 在下面的示例中,我说过将列“ a”设置为64位浮点,而将“ b”设置为32位int。

my_dtypes = {‘a’: np.float64, ‘b’: np.int32} 
pd.read_csv('file.dat', sep='\s+', dtype=my_dtypes)

You don't have to specify every column, just the ones that you want. 您不必指定每个列,只需指定所需的列即可。 It's likely that pandas figured out most of this already though by default. 尽管默认情况下,熊猫很可能已经弄清了其中的大部分。 After your call to read_csv() try 在调用read_csv()之后,尝试

df = pd.read_csv(....)
print(df.dtypes)

this will show you the data type of each of your columns. 这将显示每个列的数据类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM