使用生成器过滤具有多行记录的文件的python3

Question

I need to read huge files structured as multiline records and write to a file records with certain indices, say record numbers R = 1, 2 and 1093. If records are N = 3 lines each this amounts to reading the file line by line and then write lines numbers 1, 2, 3 and 4, 5, 6 and 3277, 3278, 3279 (in that the first line in each record Ri starts at line number Ri-1 * N + 1. 我需要读取结构成多行记录的大型文件，并写入具有某些索引的文件记录，例如记录编号R = 1、2和1093。如果记录为N = 3行，则每个记录就相当于逐行读取文件，然后写行号1、2、3和4、5、6和3277、3278、3279（因为每个记录Ri中的第一行从行号Ri-1 * N + 1开始。

I guess one could calculate the lines to write and go through the file line by line and write those lines. 我猜一个人可以计算出要写入的行，并逐行遍历文件并写入这些行。 However, is it possible to "zip" consecutive lines 1, 2 and 3 into a generator object containing records and filter these somehow or would or print them directly to file if they enumerate to R ? 但是，是否可以将连续的第1、2和3行“压缩”到包含记录的生成器对象中，并以某种方式过滤这些记录，或者如果它们枚举为R则将其直接打印到文件中？ Something along this pseudocode : 伪代码中的内容：

def subset(file_in, file_out, N, R):
    with open(file_in, "rt") as fin, open(file_out, "wt") as fout:
        line = (line.rstrip() for line in fin)
        record = enumerate(zip(line, line, line)) # What if records are of size N
        for i, r in record if i in R:
            fout.write(r)

What to do if you want the record size N as parameter ? 如果要将记录大小N作为参数怎么办？

UPDATE EXAMPLE 更新示例

An example for file_in (4 records, 3 lines/record): file_in的示例（4条记录，每条记录3行）：

dslfkj
2
a
dflkj
3
g
fds
2
b
fsdlkj
1
n

Then subset(file_in, file_out, 3, [1,3]) would give (file_out) 然后子集（file_in，file_out，3，[1,3]）将给出（file_out）

dslfkj
2
a
fds
2
b

Answer 1

For this problem, it makes sense just to tackle this directly line by line, using floor division. 对于此问题，仅使用楼层划分逐行直接解决此问题是有意义的。

For Example: 例如：

fin = '''
dslfkj
2
a
dflkj
3
g
fds
2
b
fsdlkj
1
'''

line_gen = (line.rstrip() for line in fin.strip().split())

R = [1, 3]
R = [val - 1 for val in R] #zero indexing
N = 3
for i, line in enumerate(line_gen):
    if i // N in R:
        print(line)

Output: 输出：

dslfkj
2
a
fds
2
b

Your function can look something like follows: (you may want to check if it works out of the box or requires tweaks. i did not check the file opening portion. 您的函数可能如下所示：（您可能想检查它是否可以直接使用或需要调整。我没有检查文件打开部分。

def subset(file_in, file_out, N, R):
    R = [val - 1 for val in R] #zero indexing
    with open(file_in, "rt") as fin, open(file_out, "wt") as fout:
        line_gen = (line.rstrip() for line in fin)
        for i, line in enumerate(line_gen):
            if i // N in R:
                fout.write(line)
                fout.write('\n')

Edit: The answer below pertains to how you could use the generators and group the values together. 编辑：以下答案与您如何使用生成器并将值分组在一起有关。 Having said that, i do not think you should need to use it. 话虽如此，我认为您不需要使用它。 However, if you still wish to, you can construct your function based off of it. 但是，如果您仍然愿意，则可以基于它构造函数。

Old answer: 旧答案：

You can create n references to the object using list, and then unpack using the * (aka splat) operator . 您可以使用list创建对对象的n引用，然后使用* （aka splat）运算符解压缩。

For example: 例如：

from itertools import zip_longest
line = (x for x in range(100, 132))
n = 3
record = zip(*([line] * n)) #equivalent to *[line, line, line] which is unpacked into zip arguments
for i, r in enumerate(record):
    print(i, r)

0 (100, 101, 102)
1 (103, 104, 105)
2 (106, 107, 108)
3 (109, 110, 111)
4 (112, 113, 114)
5 (115, 116, 117)
6 (118, 119, 120)
7 (121, 122, 123)
8 (124, 125, 126)
9 (127, 128, 129)

Also, depending on what you want to happen for "leftover" lines, you may want to use zip_longest instead. 另外，根据您想对“剩余”行执行的操作，您可能希望改用zip_longest 。

使用生成器过滤具有多行记录的文件的python3

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-05-29 11:54:04

使用生成器过滤具有多行记录的文件的python3

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-05-29 11:54:04

解决方案1
1 已采纳 2019-05-29 11:54:04