读取分隔线的定界文件

Question

如果对此有明显答案，我深表歉意。

我有一个很大的文件，在解析时会遇到一些挑战。 我从组织外部提供了这些文件，因此没有机会更改它们的格式。

首先，文件是用空格分隔的，但是表示数据“列”的字段可以跨越多行。 例如，如果您有一行应该是25列数据，则该行可能会写入以下文件中：

1 2 3 4 5 6 7 8 9 10 11 12 13 14
   15 16 17 18 19 20 21 
  22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
   14 15 16 17 18
  19 20 21 22 23 24 25

如您所见，我不能依靠每组数据都在同一行上，但是我可以依靠每组上相同数量的列。

更糟糕的是，该文件遵循definition：data类型格式，其中前3行左右将描述数据（包括一个告诉我有多少行的字段），接下来的N行是数据。 然后它将再次回到3行格式以描述下一组数据。 这意味着我不能只为N列格式设置阅读器，而让它运行到EOF。

恐怕内置的python文件读取功能会很快变得非常丑陋，但是我无法在csv或numpy中找到任何有效的方法。

有什么建议么？

编辑：作为另一个解决方案的示例：

MATLAB中有一个旧工具，可以在打开的文件句柄上使用textscan解析此文件。 我们知道列数，因此我们可以执行以下操作：

data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);

无论数据如何包装，它都会读取数据，同时保持文件句柄打开以稍后处理下一部分。 这样做是因为文件太大而导致过多的RAM使用。

Answer 1

这是一个草图，您可以如何进行：（编辑：进行一些修改）

file = open("testfile.txt", "r") 

# store data for the different sections here
datasections = list()

while True:
    current_row = []

    # read three lines

    l1 = file.readline()
    if line == '': # or other end condition
        break
    l2 =  file.readline()
    l3 =  file.readline()

    # extract the following information from l1, l2, l3
    nrows = # extract the number rows in the next section
    ncols = # extract the number of columns in the next section


    # loop while len(current_row) < nrows * ncols:

        # read next line, isolate the items using str.split()
        # append items to current_row


    # break current_row into the lines after each ncols-th item
    # store data in datasections in a new array

读取分隔线的定界文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-09-10 18:20:02

读取分隔线的定界文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-09-10 18:20:02

解决方案1
0 已采纳 2018-09-10 18:20:02