简体   繁体   English

如何使用Python有效地读取带有自定义换行符的大文件?

[英]How to efficiently read a large file with a custom newline character using Python?

We have a huge .csv file but it doesn't seem to really be a csv. 我们有一个很大的.csv文件,但它似乎并不是一个真正的csv。

The line endings are \\tl\\n . 行的结尾是\\tl\\n
The text between this newline character sometimes has "real" newline characters. 此换行符之间的文本有时具有“真实”换行符。 We don't want to split on those. 我们不想在这些上分开。

We currently do it using awk . 目前,我们使用awk进行此操作。

awk_code = r'BEGIN{ RS="""(\tl\n)"""; FS="\t"} { print "\42"$1"\42,\42"$2"\42,\42\42\42"$3"\42\42\42,\n";}'
bash_command_awk = f"awk '{awk_code}' {input_file_path} > {output_path}"
awk_command_output = subprocess.check_output(bash_command_awk,stderr=subprocess.STDOUT, shell=True)

I'm trying to find an efficient way of doing it directly in Python and tried passing a custom newline into the .open() command. 我试图找到一种在Python中直接执行此操作的有效方法,并尝试将自定义换行符传递到.open()命令中。

def process_without_putting_file_in_RAM(file_to_process):
    with file_to_process.open(encoding="utf-8", newline="\tl\n") as csv_file:
        for line in csv.reader(csv_file):

However, I quickly learned newline arg only accepts one of the default characters. 但是,我很快了解到换行arg仅接受默认字符之一。

How can I efficiently process this file containing the weird line ending? 如何有效处理包含怪异行结尾的文件?

Here's a function which can handle multi-character newline between chunks correctly 这是一个可以正确处理块之间的多字符换行符的函数

def line_splitter(file, newline, chunk_size=4096):
    tail = ''
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            if tail:
                yield tail
            break
        lines = (tail + chunk).split(newline)
        tail = lines.pop(0)
        if lines:
            yield tail
            tail = lines.pop()
            yield from lines

another version which, although it doesn't make copies of whole chunks didn't prove faster. 另一个版本,尽管它不能复制整个块,但并没有证明更快。 It will be marginally faster for large chunks. 对于大块,这将稍微快一些。 Do not use chunk_size less than newline size :) 不要使用小于换行符大小的chunk_size :)

def line_splitter(file, newline, chunk_size=4096):
    tail = ''
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            if tail:
                yield tail
            break
        lines = chunk.split(newline)
        tail = (tail + lines[0]).split(newline)
        if len(tail) > 1:
            lines[0] = tail[1]
        else:
            del lines[0]
        tail = tail[0]
        if lines:
            yield tail
            tail = lines.pop()
            yield from lines

The caller should be like: 呼叫者应为:

with longabstract_file.open() as f:
    for line in line_splitter(f, "\tl\n"):
        if line: # ignore blank lines
            print(line)

Assuming your csv is a comma or space delimited, not tab, what you were looking for is lineterminator flag , but there's no need for that since it's automatically assumed '\\n' is a line break. 假设您的csv是用逗号或空格分隔,而不是制表符,那么您所寻找的是lineterminator标志 ,但是没有必lineterminator ,因为它会自动假定'\\n'是换行符。 From the doc: 从文档中:

Note: The reader is hard-coded to recognise either '\\r' or '\\n' as end-of-line, and ignores lineterminator . 注意:阅读器经过硬编码,可以将'\\r''\\n'识别为行尾,而忽略lineterminator This behavior may change in the future. 这种行为将来可能会改变。

so what you can do is add string method .replace() to get rid of '\\tl' like this 所以你可以做的就是添加字符串方法.replace()来摆脱'\\tl'像这样

def process_without_putting_file_in_RAM(file_to_process):
    with file_to_process.open(encoding="utf-8") as csv_file:
        for line in csv.reader(csv_file, delimiter=","):
            print(line[-1].replace('\tl', ''))

Why not use pandas . 为什么不使用pandas Specifically pandas.read_csv using lineterminator and chunksize parameters: 特别pandas.read_csv使用lineterminatorchunksize参数的lineterminator

import pandas as pd

batch_size = 10000
new_line_str = '\tl\n'
iterator_df = pd.read_csv(file_to_process, chunksize=batch_size, lineterminator=new_line_str)
for chunk in iterator_df:
    # process chunk here

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM