使用 Python 读取具有给定结构定界的文本文件

Question

I am trying to read a text file containing several fields structured with a given number of characters each.我正在尝试读取一个文本文件，其中包含多个字段，每个字段都具有给定数量的字符。 I know that first field takes n1 characters, second field n2 chars, ...我知道第一个字段需要 n1 个字符，第二个字段需要 n2 个字符，...

This is what I have so far, for one line:到目前为止，这就是我所拥有的，一行：

# Line
line = 'AAABBCCCCDDDDDE'

# Array structure
slice_structure  = [3,2,4,5,1]

sliced_array = []
cursor = 0
for n in slice_structure :
    sliced_array.append(line[cursor:cursor+n])
    cursor += n

print(sliced_array)

The response is the following:响应如下：

['AAA', 'BB', 'CCCC', 'DDDDD', 'E']

My intention is to create a function with this code and call it for every line of the file.我的意图是使用此代码创建一个 function 并为文件的每一行调用它。 I am sure there must be a better way to do this.我相信一定有更好的方法来做到这一点。

Thanks in advance.提前致谢。

Answer 1

You can use groupby for every line you're reading from that file:您可以对从该文件中读取的每一行使用groupby ：

from itertools import groupby

line = 'AAABBCCCCDDDDDE'

result = ["".join(list(g)) for k, g in groupby(line)]

print(result)

Result:结果：

['AAA', 'BB', 'CCCC', 'DDDDD', 'E']

Answer 2

Question : unpack record fields structured with a given number of characters each.问题：解压缩每个由给定字符数构成的记录字段。

from struct import unpack

record = 'AAABBCCCCDDDDDE'

fields = [item.decode() for item in 
          unpack('3s2s4s5s1s', bytes(record, 'utf-8'))]

print(fields)
>>> ['AAA', 'BB', 'CCCC', 'DDDDD', 'E']

Answer 3

If your field names are actually text (rather than a repeated character) and you want to split your string by the values in your slice list, here's a simple / readable approach:如果您的字段名称实际上是文本（而不是重复的字符）并且您想通过切片列表中的值拆分字符串，那么这是一个简单/可读的方法：

# Line
line = 'AAABBCCCCDDDDDE'
# Array structure
slice_structure  = [3,2,4,5,1]
# Results list
result = []

for i in slice_structure:
    result.append(line[:i])
    line = line[i:]

print(result)

Output: Output：

['AAA', 'BB', 'CCCC', 'DDDDD', 'E']

Answer 4

You could do it using following two methods.您可以使用以下两种方法来做到这一点。

Method-1 :方法一：
Uses list.insert to place some separators ( '|' ) and then split the string using these separators.使用 list.insert 放置一些分隔符 ( '|' )，然后使用这些分隔符拆分字符串。

Method-2 :方法二：
Uses list comprehension.使用列表理解。

import numpy as np

# Line
line = 'AAABBCCCCDDDDDE'
# Array structure
slice_structure  = [3,2,4,5,1]
ss = np.array(slice_structure).cumsum()

# Method-1
# >> Uses list.insert to place some separators ('|')
#    and then split the string using these separators.
l = list(line)
for p in np.flip(ss[:-1]):
    l.insert(p,'|')
final_1 = ''.join(l).split('|')
print('Method-1: {}'.format(final_1))

# Method-2
# >> Uses list comprehension
stop_pos = ss.tolist()
start_pos = [0] + ss[:-1].tolist()
final_2 = [line[start:stop] for start, stop in zip(start_pos, stop_pos)]
print('Method-2: {}'.format(final_2))

Output : Output ：

Method-1: ['AAA', 'BB', 'CCCC', 'DDDDD', 'E']
Method-2: ['AAA', 'BB', 'CCCC', 'DDDDD', 'E']

使用 Python 读取具有给定结构定界的文本文件

问题描述

4 个解决方案

解决方案1
2 已采纳 2019-10-15 17:58:51

解决方案2
1 2019-10-15 18:41:35

解决方案3
1 2019-10-15 19:44:57

Output: Output：

解决方案4
1 2019-10-15 20:22:04

使用 Python 读取具有给定结构定界的文本文件

问题描述

4 个解决方案

解决方案1 2 已采纳 2019-10-15 17:58:51

解决方案2 1 2019-10-15 18:41:35

解决方案3 1 2019-10-15 19:44:57

Output: Output：

解决方案4 1 2019-10-15 20:22:04

解决方案1
2 已采纳 2019-10-15 17:58:51

解决方案2
1 2019-10-15 18:41:35

解决方案3
1 2019-10-15 19:44:57

解决方案4
1 2019-10-15 20:22:04