在 Python3 中迭代文件的行时替代`tell()`？

Question

在 Python3 中迭代文件时，如何找出文件光标的位置？

在 Python 2.7 中很简单，使用tell() 。 在 Python3 中，相同的调用会引发OSError ：

Traceback (most recent call last):
  File "foo.py", line 113, in check_file
    pos = infile.tell()
OSError: telling position disabled by next() call

我的用例是制作一个用于读取大型 CSV 文件的进度条。 计算总行数太昂贵并且需要额外的通过。 近似值非常有用，我不关心缓冲区或其他噪声源，我想知道它是否需要 10 秒或 10 分钟。

重现问题的简单代码。 它在 Python 2.7 上按预期工作，但在 Python 3 上抛出：

file_size = os.stat(path).st_size
with open(path, "r") as infile:
    reader = csv.reader(infile)
    for row in reader:
        pos = infile.tell()  # OSError: telling position disabled by next() call
        print("At byte {} of {}".format(pos, file_size))

这个答案https://stackoverflow.com/a/29641787/321772表明问题在于next()方法在迭代期间禁用了tell() 。 替代方法是手动逐行读取，但该代码位于 CSV 模块内，因此我无法理解。 我也无法理解 Python 3 通过禁用tell()获得了什么。

那么在 Python 3 中遍历文件行时找出字节偏移量的首选方法是什么？

Answer 1

csv 模块只希望reader调用的第一个参数是一个迭代器，它在每次next调用时返回一行。 所以你可以只使用迭代器包装而不是计算字符。 如果您希望计数准确，则必须以二进制模式打开文件。 但实际上，这很好，因为您将没有 csv 模块预期的行尾转换。

所以一个可能的包装是：

class SizedReader:
    def __init__(self, fd, encoding='utf-8'):
        self.fd = fd
        self.size = 0
        self.encoding = encoding   # specify encoding in constructor, with utf8 as default
    def __next__(self):
        line = next(self.fd)
        self.size += len(line)
        return line.decode(self.encoding)   # returns a decoded line (a true Python 3 string)
    def __iter__(self):
        return self

您的代码将变为：

file_size = os.stat(path).st_size
with open(path, "rb") as infile:
    szrdr = SizedReader(infile)
    reader = csv.reader(szrdr)
    for row in reader:
        pos = szrdr.size  # gives position at end of current line
        print("At byte {} of {}".format(pos, file_size))

这里的好消息是您保留了 csv 模块的所有功能，包括引用字段中的换行符...

Answer 2

如果您在没有 csv 模块的情况下感到舒服。 您可以执行以下操作：

import os, csv

file_size = os.path.getsize('SampleCSV.csv')
pos = 0

with open('SampleCSV.csv', "r") as infile:
    for line in infile:
        pos += len(line) + 1    # 1 for newline character
        row = line.rstrip().split(',')
        print("At byte {} of {}".format(pos, file_size))

但这在字段本身包含 \\" 的情况下可能不起作用。

例如： 1,"Hey, you..",22:04虽然这些也可以使用正则表达式来处理。

Answer 3

由于您的csvfile太大，根据您提到的页面还有另一种解决方案：

使用offset += len(line)而不是file.tell() 。 例如，

offset = 0
with open(path, mode) as file:
    for line in file:
        offset += len(line)

在 Python3 中迭代文件的行时替代`tell()`？

问题描述

3 个解决方案

解决方案1
6 已采纳 2017-09-25 15:09:15

解决方案2
0 2017-09-25 13:52:50

解决方案3
0 2021-01-14 07:36:25

在 Python3 中迭代文件的行时替代`tell()`？

问题描述

3 个解决方案

解决方案1 6 已采纳 2017-09-25 15:09:15

解决方案2 0 2017-09-25 13:52:50

解决方案3 0 2021-01-14 07:36:25

解决方案1
6 已采纳 2017-09-25 15:09:15

解决方案2
0 2017-09-25 13:52:50

解决方案3
0 2021-01-14 07:36:25