简体   繁体   中英

reading dat files with pandas by format string

reading a fixed width .dat file in pandas is not very complicated using the pd.read_csv('file.dat', sep='\\s+') or the pd.read_fwf('file.dat', widths=[7, ..]) method. But in the file is also given a format string like this:

Format = (i7,1x,i7,1x,i2,1x,i2,1x,i2,1x,f5.1,1x,i4,1x,3i,1x,f4.1,1x,i1,1x,f4.1,1x,i3,1x,i4,1x,i4,1x,i3,1x,i4,2x,i1)

looking at the columns content, I assume the character indicates the datatype (i->int, f->float, x->seperator) and the number is obviously the width of the column. Is this a standard notation? Is there a more pythonic way to read data files by just passing this format string and make scripts save against format changes in the data file?

I noticed the format argument for the read_fwf() function, but it takes a list of pairs (int, int) not the type of format string that is given. First rows of the data file:

list of pairs (int, int)

This is a pretty standard way to indicate format using the C printf convention. The format is only really important if you are trying to write the file in an identical manner. For the purpose of reading it all into pandas you don't really care. If you want control over the specific data type of each column as you read it in you use the dtype parameter. In the example below I said to make column 'a' a 64-bit floag and 'b' a 32-bit int.

my_dtypes = {‘a’: np.float64, ‘b’: np.int32} 
pd.read_csv('file.dat', sep='\s+', dtype=my_dtypes)

You don't have to specify every column, just the ones that you want. It's likely that pandas figured out most of this already though by default. After your call to read_csv() try

df = pd.read_csv(....)
print(df.dtypes)

this will show you the data type of each of your columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM