简体   繁体   English

在python中读取和保存具有可变列数的数据文件

[英]Read and save data file with variable number of columns in python

I have a space separated data file that looks like this (just a slice) 我有一个用空格分隔的数据文件,看起来像这样(只是一个切片)

Wavelength  Ele   Excit   loggf       D0        
11140.324   108.0 3.44     -7.945    4.395
11140.357    26.1 12.09    -2.247
11140.361   108.0 2.39     -8.119    4.395
11140.365    25.0 5.85    -9.734
11140.388    23.0 4.56    -4.573
11140.424   608.0 5.12    -10.419    11.09 
11140.452   606.0 2.12    -11.054     6.25 
11140.496   108.0 2.39    -8.119      4.395
11140.509   606.0 1.70    -7.824      6.25 

Part 1 第1部分

First I would like to read the file a lá np.loadtxt . 首先,我想读取文件np.loadtxt This does not work, so I tried with 这不起作用,所以我尝试了

d = np.genfromtxt('file.dat', skiprows=1, filling_value=0.0, missing_values=' ')

and different versions of that. 以及不同的版本。 All gave errors: Line #3 (got 4 columns instead of 5) . 所有人都给出了错误: Line #3 (got 4 columns instead of 5) I think I'm close to be able to read the file. 我想我已经可以读取该文件了。 Note, that I prefer a solution with something like np.genfromtxt rather than openning the file and go through it line by line: 请注意,我更喜欢使用np.genfromtxt类的解决方案,而不是打开文件并逐行进行处理:

with open('test.dat', 'r') as lines:
    for line in lines:
        # put numbers in arrays/lists

Part 2 第2部分

After reading the file succesfully, I need to save it in a specific format. 成功读取文件后,我需要将其保存为特定格式。 Very briefly, this file will be a input for a Fotran program, with 10 spaces per column for numbers. 简而言之,该文件将是Fotran程序的输入,每列数字有10个空格。 Without the last column ( D0 ), I can use (there is a column I don't use, hence the '%27.1f' ) 没有最后一列( D0 ),我可以使用(有一个我不使用的列,因此是'%27.1f'

fmt_ = ('%9.2f', '%7.1f', '%11.2f','%10.3f', '%27.1f')
np.savetxt('output.dat', data, fmt=fmt_)

But I suspect this wouldn't work either. 但是我怀疑这也不行。 So s np.genfromtxt for saving could be helpful. 因此,用于保存的s np.genfromtxt可能会有所帮助。

Help for it all, one part or just some guidance are appreciated. 对此提供帮助,请务必提供一部分或仅提供一些指导。

Part 1: 第1部分:

Use pandas. 使用大熊猫。 It is designed specifically to handle this sort of scenario: 它是专门为处理这种情况而设计的:

import pandas as pd
df = pd.read_csv('test.csv', sep='\s+')
print(df)

gives you: 给你:

   Wavelength    Ele  Excit   loggf      D0
0   11140.324  108.0   3.44  -7.945   4.395
1   11140.357   26.1  12.09  -2.247     NaN
2   11140.361  108.0   2.39  -8.119   4.395
3   11140.365   25.0   5.85  -9.734     NaN
4   11140.388   23.0   4.56  -4.573     NaN
5   11140.424  608.0   5.12 -10.419  11.090
6   11140.452  606.0   2.12 -11.054   6.250
7   11140.496  108.0   2.39  -8.119   4.395
8   11140.509  606.0   1.70  -7.824   6.250

Part 2 第2部分

You can also use pandas for this, although it is a bit more complex to get the formatting correct: 您也可以为此使用pandas,尽管要正确设置格式要复杂一些:

formatters  = ['{: >9.2f}'.format, '{: >7.1f}'.format, 
               '{: >11.2f}'.format,'{: >10.3f}'.format, 
               lambda x: ' '*27 if np.isnan(x) else '{: >27.1f}'.format(x)]

lines = df.to_string(index=False, header=False, formatters=formatters)

with open('out.dat', 'w') as outfile:
    outfile.write(lines)

Gives you: 给你:

 11140.32   108.0        3.44     -7.945                         4.4
 11140.36    26.1       12.09     -2.247                            
 11140.36   108.0        2.39     -8.119                         4.4
 11140.36    25.0        5.85     -9.734                            
 11140.39    23.0        4.56     -4.573                            
 11140.42   608.0        5.12    -10.419                        11.1
 11140.45   606.0        2.12    -11.054                         6.2
 11140.50   108.0        2.39     -8.119                         4.4
 11140.51   606.0        1.70     -7.824                         6.2

Here's part of a sample run with your data. 这是与数据一起运行的示例的一部分。

In [62]: txt=b"""Wavelength  Ele   Excit   loggf       D0        
11140.324   108.0 3.44     -7.945    4.395
...
11140.509   606.0 1.70    -7.824      6.25 """

In [63]: txt=txt.splitlines()

In [64]: def foo(astr):
    # add a 'NaN' field to the short lines
    if len(astr)<35:
        astr += b'  NaN'  # or filler of your choice
    return astr
   ....: 

In [65]: data=np.loadtxt([foo(t) for t in txt], skiprows=1)

In [66]: data
Out[66]: 
array([[  1.11403240e+04,   1.08000000e+02,   3.44000000e+00,
         -7.94500000e+00,   4.39500000e+00],
       [  1.11403570e+04,   2.61000000e+01,   1.20900000e+01,
         -2.24700000e+00,              nan],
        ...
       [  1.11405090e+04,   6.06000000e+02,   1.70000000e+00,
         -7.82400000e+00,   6.25000000e+00]])

In [67]: np.savetxt('test.dat',x,fmt=fmt_)

In [69]: cat test.dat
 11140.32   108.0        3.44     -7.945                         4.4
 11140.36    26.1       12.09     -2.247                         nan
 11140.36   108.0        2.39     -8.119                         4.4
 11140.36    25.0        5.85     -9.734                         nan
 ...
 11140.51   606.0        1.70     -7.824                         6.2

The file can be passed through foo like this: 该文件可以这样通过foo传递:

with open('test.dat') as f: 
     xx = np.loadtxt((foo(t) for t in f),skiprows=1)

savetxt essentially does a row by row write , so it isn't hard to write your own version. savetxt本质上是逐行write ,因此编写自己的版本并不难。 eg 例如

In [120]: asbytes=np.lib.npyio.asbytes

In [121]: fmt__='%9.2f  %7.1f  %11.2f  %10.3f  %10.1f'

In [122]: with open('test.dat','wb') as f: 
     for row in x:
        f.write(asbytes(fmt__%tuple(row)+'\n'))
   .....:         

In [123]: cat test.dat
 11140.32    108.0         3.44      -7.945         4.4
 11140.36     26.1        12.09      -2.247         nan
 11140.36    108.0         2.39      -8.119         4.4
 11140.36     25.0         5.85      -9.734         nan
 ...
 11140.51    606.0         1.70      -7.824         6.2

With this it wouldn't be hard to test each row, and use a different format for rows with a nan . 有了它,测试每一行并为具有nan行使用不同的格式就不难了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM