读取具有指定分隔符的文件以换行

Question

我有一个文件，其中使用分隔符分隔行. 。 我想逐行阅读这个文件，其中的行应该基于存在. 而不是换行线。

一种方法是：

f = open('file','r')
for line in f.read().strip().split('.'):
   #....do some work
f.close()

但如果我的文件太大，这不是内存效率。 我没有一起阅读整个文件，而是想逐行阅读。

open支持参数“换行”，但这个参数仅对None, '', '\\n', '\\r', and '\\r\\n'作为输入，提到这里。

有没有办法有效地读取文件行但是基于预先指定的分隔符？

Answer 1

你可以使用一个发电机：

def myreadlines(f, newline):
  buf = ""
  while True:
    while newline in buf:
      pos = buf.index(newline)
      yield buf[:pos]
      buf = buf[pos + len(newline):]
    chunk = f.read(4096)
    if not chunk:
      yield buf
      break
    buf += chunk

with open('file') as f:
  for line in myreadlines(f, "."):
    print line

Answer 2

最简单的方法是预处理文件以生成所需的换行符。

以下是使用perl的示例（假设您希望字符串'abc'为换行符）：

perl -pe 's/abc/\n/g' text.txt > processed_text.txt

如果您还想忽略原始换行符，请改用以下内容：

perl -ne 's/\n//; s/abc/\n/g; print' text.txt > processed_text.txt

Answer 3

这是一个更有效的答案，使用我用于解析PDF文件的FileIO和bytearray -

import io
import re


# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'\r\n|\r|\n'  

# the end-of-file char
EOF = b'%%EOF'



def readlines(fio):
    buf = bytearray(4096)
    while True:
        fio.readinto(buf)
        try:
            yield buf[: buf.index(EOF)]
        except ValueError:
            pass
        else:
            break
        for line in re.split(EOL_REGEX, buf):
            yield line


with io.FileIO("test.pdf") as fio:
    for line in readlines(fio):
        ...

上面的示例还处理自定义EOF。 如果您不想这样，请使用：

import io
import os
import re


# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'\r\n|\r|\n'  


def readlines(fio, size):
    buf = bytearray(4096)
    while True:
        if fio.tell() >= size:
            break               
        fio.readinto(buf)            
        for line in re.split(EOL_REGEX, buf):
            yield line

size = os.path.getsize("test.pdf")
with io.FileIO("test.pdf") as fio:
    for line in readlines(fio, size):
         ...

读取具有指定分隔符的文件以换行

问题描述

3 个解决方案

解决方案1
20 已采纳 2013-04-28 06:10:07

解决方案2
2 2013-05-07 23:15:45

解决方案3
1 2018-12-10 10:56:34

读取具有指定分隔符的文件以换行

问题描述

3 个解决方案

解决方案1 20 已采纳 2013-04-28 06:10:07

解决方案2 2 2013-05-07 23:15:45

解决方案3 1 2018-12-10 10:56:34

解决方案1
20 已采纳 2013-04-28 06:10:07

解决方案2
2 2013-05-07 23:15:45

解决方案3
1 2018-12-10 10:56:34