使用多个分隔符解析 txt 文件

Question

我有一个 data.txt 文件，其中包含对象之间的交互 w.r.t 它们之间的距离。 举个小例子，假设我有对象 a,b,c,A,B,C; 我只用一个距离值测量了它们的相互作用。 那么output的格式是这样的：

header
    Distance aA bA cA
             aB bB cB
             aC bC cC

下面是一个小的真实例子

% rows 3 cols 10
    0.001000     0.270443    -0.276056     0.277961
                 0.241303     0.227167     0.227000
                -0.238565     0.257939     0.275644
    0.002000     0.126853     0.121890     0.115652
                 0.137218     0.136350     0.132567
                 0.116713     0.113115     0.111461
    0.003000     0.201059     0.184873    -0.170027
                 0.132424    -0.122704    -0.112826
                 0.089461     0.086023     0.084290

我想解析这个数据文件并将其重塑为一个矩阵，该矩阵实际上具有 header（在本例中为 3x10）中给出的大小，以便我可以 plot 进行特定交互 w.Z4B43B0AEE35624CD9335B910189BDC2。 距离。

第一个问题当然是距离列，因为每个值之间都有空间。 因此，作为第一次尝试，我删除了距离列（因为我已经知道这些数据），并尝试使用以下代码解析交互项：

     import numpy as np

     with open('data.txt', 'r') as the_file:
         all_data = [line.strip() for line in the_file.readlines()]
         header = all_data[0].split()
         row=int(header[2])
         cols=int(header[4])

     lines=np.loadtxt("data.txt", delimiter="     ", skiprows=1)
         a=np.reshape(lines, (row,cols));

，但负值会弄乱分隔符。 所以我的问题是，我怎样才能解析这个文件（如果可能的话，保留距离列）？

我知道这是一个非常具体的问题，但即使是在正确的方向上稍加推动，我也会感激不尽。 我已经尝试过np.split和pandas库，但无法得到我想要的结果。

Answer 1

另一种方法可能是使用正则表达式来解析文件中的值：

import re

f = open('data.txt', 'r')
lines = f.readlines()
f.close()
expr = r'\-?\d+.\d*'
expr_compiled = re.compile(expr)

data_values = [expr_compiled.findall(l) for l in lines]

在示例中， data_values将是包含以下内容的常规列表列表：

[['0.001000', '0.270443', '-0.276056', '0.277961'],
 ['0.241303', '0.227167', '0.227000'],
 ['-0.238565', '0.257939', '0.275644'],
 ['0.002000', '0.126853', '0.121890', '0.115652'],
 ['0.137218', '0.136350', '0.132567'],
 ['0.116713', '0.113115', '0.111461'],
 ['0.003000', '0.201059', '0.184873', '-0.170027'],
 ['0.132424', '-0.122704', '-0.112826'],
 ['0.089461', '0.086023', '0.084290']]

当然，这需要您将每个值转换为有效数字，以便对它们执行数学运算并从其中一些列表中提取第一个值，因为这些值代表不同的东西。

最后，您可以使用 numpy arrays 并根据需要重塑它们。

Answer 2

一个粗略的“解决方案”（假设数据文件格式完美）：

with open('matrix.dat', 'r') as data_file:
    rows, cols = [int(c) for c in data_file.readline().split() if c.isnumeric()]
    array = np.fromstring(data_file.read(), sep=' ').reshape(rows, cols)

这是一个可能不必要的替代方法，它避免将整个文件作为单个字符串读取：

import itertools
chainstar = itertools.chain.from_iterable
with open('matrix.dat', 'r') as data_file:
    rows, cols = [int(c)
                  for c in data_file.readline().split()
                  if c.isnumeric()]
    array = np.fromiter(chainstar(map(lambda s:s.split(), data_file)),
                        dtype=np.float,
                        count=rows*cols).reshape(rows, cols)

Answer 3

如果您将值存储为浮点数，则只需将分隔符减少一个空格。 正值的前导空格不会影响值转换。

使用多个分隔符解析 txt 文件

问题描述

3 个解决方案

解决方案1
0 2021-04-08 23:58:51

解决方案2
0 已采纳 2021-04-09 04:47:05

解决方案3
-1 2021-04-08 23:41:00

使用多个分隔符解析 txt 文件

问题描述

3 个解决方案

解决方案1 0 2021-04-08 23:58:51

解决方案2 0 已采纳 2021-04-09 04:47:05

解决方案3 -1 2021-04-08 23:41:00

解决方案1
0 2021-04-08 23:58:51

解决方案2
0 已采纳 2021-04-09 04:47:05

解决方案3
-1 2021-04-08 23:41:00