简体   繁体   中英

Leaking memory parsing TSV and writing CSV in Python

I'm writing a simple script in Python as a learning exercise. I have a TSV file I've downloaded from the Ohio Board of Elections, and I want to manipulate some of the data and write out a CSV file for import into another system.

My issue is that it's leaking memory like a sieve. On a single run of a 154MB TSV file it consumes 2GB of memory before I stop it.

The code is below, can someone please help me identify what I'm missing with Python?

import csv
import datetime
import re

def formatAddress(row):
    address = ''
    if str(row['RES_HOUSE']).strip():
        address += str(row['RES_HOUSE']).strip()
    if str(row['RES_FRAC']).strip():
        address += '-' + str(row['RES_FRAC']).strip()
    if str(row['RES STREET']).strip():
        address += ' ' + str(row['RES STREET']).strip()
    if str(row['RES_APT']).strip():
        address += ' APT ' + str(row['RES_APT']).strip()
    return address

vote_type_map = {
    'G': 'General',
    'P': 'Primary',
    'L': 'Special'
}

def formatRow(row, fieldnames):
    basic_dict = {
        'Voter ID': str(row['VOTER ID']).strip(),
        'Date Registered': str(row['REGISTERED']).strip(),
        'First Name': str(row['FIRSTNAME']).strip(),
        'Last Name': str(row['LASTNAME']).strip(),
        'Middle Initial': str(row['MIDDLE']).strip(),
        'Name Suffix': str(row['SUFFIX']).strip(),
        'Voter Status': str(row['STATUS']).strip(),
        'Current Party Affiliation': str(row['PARTY']).strip(),
        'Year Born': str(row['DATE OF BIRTH']).strip(),
        #'Voter Address': formatAddress(row),
        'Voter Address': formatAddress({'RES_HOUSE': row['RES_HOUSE'], 'RES_FRAC': row['RES_FRAC'], 'RES STREET': row['RES STREET'], 'RES_APT': row['RES_APT']}),
        'City': str(row['RES_CITY']).strip(),
        'State': str(row['RES_STATE']).strip(),
        'Zip Code': str(row['RES_ZIP']).strip(),
        'Precinct': str(row['PRECINCT']).strip(),
        'Precinct Split': str(row['PRECINCT SPLIT']).strip(),
        'State House District': str(row['HOUSE']).strip(),
        'State Senate District': str(row['SENATE']).strip(),
        'Federal Congressional District': str(row['CONGRESSIONAL']).strip(),
        'City or Village Code': str(row['CITY OR VILLAGE']).strip(),
        'Township': str(row['TOWNSHIP']).strip(),
        'School District': str(row['SCHOOL']).strip(),
        'Fire': str(row['FIRE']).strip(),
        'Police': str(row['POLICE']).strip(),
        'Park': str(row['PARK']).strip(),
        'Road': str(row['ROAD']).strip()
    }

    for field in fieldnames:
        m = re.search('(\d{2})(\d{4})-([GPL])', field)
        if m:
            vote_type = vote_type_map[m.group(3)] or 'Other'
            #print { 'k1': m.group(1), 'k2': m.group(2), 'k3': m.group(3)}
            d = datetime.date(year=int(m.group(2)), month=int(m.group(1)), day=1)
            csv_label = d.strftime('%B %Y') + ' ' + vote_type + ' Ballot Requested'
            d = None
            basic_dict[csv_label] = row[field]
        m = None

    return basic_dict

output_rows = []
output_fields = []
with open('data.tsv', 'r') as f:
    r = csv.DictReader(f, delimiter='\t')
    #f.seek(0)
    fieldnames = r.fieldnames
    for row in r:
        output_rows.append(formatRow(row, fieldnames))
f.close()

if output_rows:
    output_fields = sorted(output_rows[0].keys())
    with open('data_out.csv', 'wb') as f:
        w = csv.DictWriter(f, output_fields, quotechar='"')
        w.writeheader()
        for row in output_rows:
            w.writerow(row)
    f.close()

You are accumulating all the data into a huge list, output_rows . You need to process each row as you read it, instead of saving all of them into a memory-expensive Python list.

with open('data.tsv', 'rb') as fin, with open('data_out.csv', 'wb') as fout:
    reader = csv.DictReader(fin, delimiter='\t')
    firstrow = next(r)
    fieldnames = reader.fieldnames
    basic_dict = formatRow(firstrow, fieldnames)
    output_fields = sorted(basic_dict.keys())
    writer = csv.DictWriter(fout, output_fields, quotechar='"')
    writer.writeheader()
    writer.writerow(basic_dict)
    for row in reader:
        basic_dict = formatRow(row, fieldnames)        
        writer.writerow(basic_dict)

You're not leaking any memory, you're just using a ton of memory.

You're turning each line of text into a dict of Python strings, which takes considerably more memory than a single string. For full details, see Why does my 100MB file take 1GB of memory?

The solution is to do this iteratively. You don't actually need the whole list, because you never refer back to any previous values. So:

with open('data.tsv', 'r') as fin, open('data_out.csv', 'w') as fout:
    r = csv.DictReader(fin, delimiter='\t')
    output_fields = sorted(r.fieldnames)
    w = csv.DictWriter(fout, output_fields, quotechar='"')
    w.writeheader()
    for row in r:
        w.writerow(formatRow(row, fieldnames))

Or, even more simply:

    w.writerows(formatRow(row, fieldnames) for row in r)

Of course this is slightly different from you original code in that it creates the output file even if the input file is empty. You can fix that pretty easily if it's important:

with open('data.tsv', 'r') as fin:
    r = csv.DictReader(fin, delimiter='\t')
    first_row = next(r)
    if row:
        with open('data_out.csv', 'wb') as fout:
            output_fields = sorted(r.fieldnames)
            w = csv.DictWriter(fout, output_fields, quotechar='"')
            w.writeheader()
            w.writerow(formatRow(row, fieldnames))
        for row in r:
            w.writerow(formatRow(row, fieldnames))

maybe it helps some with an similar problem..

While reading a plain CSV file line by line and deciding by a field if it should be saved in file A or file B, a memory overflow occurred and my kernel died. I therefore analyzed my memory usage and this small change 1. tripled the iterations by a cut 2. fixed the problem with the memory leackage

That was my Code with memory leakage and long runtime

with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
   input_csv = csv.reader(input_file)
   file_A_csv = csv.writer(file_A)
   file_B_csv = csv.writer(file_B)
   for row in input_file:
       condition_row = row[1]
       if condition_row == 'condition':
           file_A.writerow(row)
       else: 
           file_B.write(row)

BUT if you don't declare the variable (or more variables of your reading file) before like this:

with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
input_csv = csv.reader(input_file)
file_A_csv = csv.writer(file_A)
file_B_csv = csv.writer(file_B)
for row in input_file:
    if row[1] == 'condition':
        file_A.writerow(row)
    else: 
        file_B.write(row)

I can not explain why this is so, but after some tests I could determine that I am on average 3 times as fast and my RAM is close to zero.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM