简体   繁体   中英

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a.txt file that looks like this (let's name it "myfile.txt"):

28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank'~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@#29760145'~'0'~'The Freedom Bank of Virginia'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#

I have tried a couple of ways to convert this.txt into a.csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:

import pandas as pd
import time
  
#time at the start of program is noted
start = time.time()

# We set the path where our file is located and read it
path = r'myfile.txt'
f =  open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#@#@#", "\n").replace("'", "")
# We read everything in columns with the separator "~" 
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
  
#total time taken to print the file
print("Execution time in seconds: ",(end - start))

This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.

I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:

import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
    write = csv.writer(outfile)
    write.writerows(split_content)

Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large.txt files into.csv? Likely above 35GB?

I'd be happy to read any suggestions you may have, thanks in advance!

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:

def process(fn_in, fn_out, columns):
    new_line = b'#@#@#'
    with open(fn_out, 'wb') as f_out:
        # write the header
        f_out.write((','.join(columns)+'\n').encode())
        i = 0
        with open(fn_in, "rb") as f_in:
            while (b := f_in.read(1)):
                if ord(b) == new_line[i]:
                    # keep matching the newline block
                    i += 1
                    if i == len(new_line):
                        # if matched entirely, write just a newline
                        f_out.write(b'\n')
                        i = 0
                    # write nothing while matching
                    continue
                elif i > 0:
                    # if you reach this, it was a partial match, write it
                    f_out.write(new_line[:i])
                    i = 0
                if b == b"'":
                    pass
                elif b == b"~":
                    f_out.write(b',')
                else:
                    # write the byte if no match
                    f_out.write(b)


process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])

That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.

This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.

Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.

I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8 ...) to get a test file that is 31GB.

Following @Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes , with a peak RAM usage depending on the chunk size.

The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.

My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:

CHUNK_SZ = 1024 * 1024

FS = "'~'"
RS = '#@#@#'

# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['#@#@', '#@#', '#@', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS =  PARTIAL_FSES + PARTIAL_RSES 

f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')

f_in = open('my_file.txt')
line = ''
while True:
    # Read chunks till no more, then break out
    chunk = f_in.read(CHUNK_SZ)
    if not chunk:
        break

    # Any previous partial separator, plus new chunk
    line += chunk

    # Check end-of-line for a partial FS or RS; only when separators are more than one char
    final_partial = ''

    if line.endswith(FS) or line.endswith(RS):
        pass  # Write-out will replace complete FS or RS
    else:
        for partial in ALL_PARTIALS:
            if line.endswith(partial):
                final_partial = partial
                line = line[:-len(partial)]
                break

    # Process/write chunk
    f_out.write(line
                .replace(FS, ',')
                .replace(RS, '\n'))

    # Add partial back, to be completed next chunk
    line = final_partial


# Clean up
f_in.close()
f_out.close()

Just to share an alternative way, based on convtools ( table docs | github ). This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader ).

Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.

from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table

def get_rows(filename):
    with open(filename, "r") as f:
        for row in split_buffer(f, "#@#@#"):
            yield row.replace("'", "")

Table.from_csv(
    get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM