简体   繁体   English

使用 Python 进行预处理后,将 large.txt 文件(大小 >30GB).txt 转换为.csv 的最有效方法

[英]Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a.txt file that looks like this (let's name it "myfile.txt"):我在 a.txt 文件中有如下所示的数据(我们将其命名为“myfile.txt”):

28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank'~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@#29760145'~'0'~'The Freedom Bank of Virginia'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@# 28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank '~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@# 29760145'~'0'~'弗吉尼亚自由银行'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#

I have tried a couple of ways to convert this.txt into a.csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:我尝试了几种方法将 this.txt 转换为 a.csv,其中一种是使用 CSV 库,但由于我非常喜欢 Panda,所以我使用了以下方法:

import pandas as pd
import time
  
#time at the start of program is noted
start = time.time()

# We set the path where our file is located and read it
path = r'myfile.txt'
f =  open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#@#@#", "\n").replace("'", "")
# We read everything in columns with the separator "~" 
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
  
#total time taken to print the file
print("Execution time in seconds: ",(end - start))

This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.这需要大约 35 秒来处理,是一个 300MB 的文件,我可以接受这种类型的性能,但我正在尝试对一个更大的文件(大小为 35GB)做同样的事情,它会产生一个 MemoryError 消息。

I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:我尝试使用 CSV 库,但结果相似,我尝试将所有内容放入列表中,然后将其写入 CSV:

import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
    write = csv.writer(outfile)
    write.writerows(split_content)

Results were similar, not a huge improvement.结果是相似的,没有很大的改进。 Is there a way or methodology I can use to convert VERY large.txt files into.csv?有没有一种方法可以将非常大的 .txt 文件转换为 .csv? Likely above 35GB?可能超过 35GB?

I'd be happy to read any suggestions you may have, thanks in advance!我很乐意阅读您可能提出的任何建议,在此先感谢!

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:由于您的代码只是直接替换,因此您可以按顺序读取所有数据并检测 go 时需要替换的部件:

def process(fn_in, fn_out, columns):
    new_line = b'#@#@#'
    with open(fn_out, 'wb') as f_out:
        # write the header
        f_out.write((','.join(columns)+'\n').encode())
        i = 0
        with open(fn_in, "rb") as f_in:
            while (b := f_in.read(1)):
                if ord(b) == new_line[i]:
                    # keep matching the newline block
                    i += 1
                    if i == len(new_line):
                        # if matched entirely, write just a newline
                        f_out.write(b'\n')
                        i = 0
                    # write nothing while matching
                    continue
                elif i > 0:
                    # if you reach this, it was a partial match, write it
                    f_out.write(new_line[:i])
                    i = 0
                if b == b"'":
                    pass
                elif b == b"~":
                    f_out.write(b',')
                else:
                    # write the byte if no match
                    f_out.write(b)


process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])

That does it pretty quickly.这样做很快。 You may be able to improve performance by reading in chunks, but this is pretty quick all the same.您可以通过分块阅读来提高性能,但这仍然非常快。

This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.与您的方法相比,这种方法的优势在于它在 memory 中几乎没有任何内容,但它对优化快速读取文件的作用很小。

Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.编辑:在一个边缘案例中存在一个大错误,我在重新阅读后意识到,现在已修复。

I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8 ...) to get a test file that is 31GB.我拿了你的示例字符串,并通过将该字符串乘以 1 亿(类似于your_string*1e8 ...)来制作一个示例文件,以获得一个 31GB 的测试文件。

Following @Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes , with a peak RAM usage depending on the chunk size.遵循@Grismar 的分块建议,我做了以下操作,它在~2 分钟内处理该 31GB 文件,峰值 RAM 使用量取决于块大小。

The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.复杂的部分是跟踪字段和记录分隔符,它们是多个字符,肯定会跨越一个块,因此会被截断。

My solution is to inspect the end of each chunk and see if it has a partial separator.我的解决方案是检查每个块的末尾,看看它是否有部分分隔符。 If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:如果是这样,则从当前块的末尾删除该部分,当前块被写出,并且该部分成为下一个块的开始(并且应该由下一个块完成):

CHUNK_SZ = 1024 * 1024

FS = "'~'"
RS = '#@#@#'

# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['#@#@', '#@#', '#@', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS =  PARTIAL_FSES + PARTIAL_RSES 

f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')

f_in = open('my_file.txt')
line = ''
while True:
    # Read chunks till no more, then break out
    chunk = f_in.read(CHUNK_SZ)
    if not chunk:
        break

    # Any previous partial separator, plus new chunk
    line += chunk

    # Check end-of-line for a partial FS or RS; only when separators are more than one char
    final_partial = ''

    if line.endswith(FS) or line.endswith(RS):
        pass  # Write-out will replace complete FS or RS
    else:
        for partial in ALL_PARTIALS:
            if line.endswith(partial):
                final_partial = partial
                line = line[:-len(partial)]
                break

    # Process/write chunk
    f_out.write(line
                .replace(FS, ',')
                .replace(RS, '\n'))

    # Add partial back, to be completed next chunk
    line = final_partial


# Clean up
f_in.close()
f_out.close()

Just to share an alternative way, based on convtools ( table docs | github ).只是为了分享一种基于 convtools 的替代方式(表文档| github )。 This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader ).此解决方案比 OP 更快,但比 Zach 慢约 7 倍(Zach 使用 str 块,而这个使用行元组,通过csv.reader读取)。

Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.不过,这种方法可能很有用,因为它允许利用 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 处理并使用列、重新排列它们、添加新的等。

from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table

def get_rows(filename):
    with open(filename, "r") as f:
        for row in split_buffer(f, "#@#@#"):
            yield row.replace("'", "")

Table.from_csv(
    get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM