简体   繁体   English

泄漏 memory 解析 TSV 并将 CSV 写入 Python

[英]Leaking memory parsing TSV and writing CSV in Python

I'm writing a simple script in Python as a learning exercise.我正在 Python 中编写一个简单的脚本作为学习练习。 I have a TSV file I've downloaded from the Ohio Board of Elections, and I want to manipulate some of the data and write out a CSV file for import into another system.我有一个从俄亥俄州选举委员会下载的 TSV 文件,我想处理一些数据并写出一个 CSV 文件以导入另一个系统。

My issue is that it's leaking memory like a sieve.我的问题是它像筛子一样漏水 memory。 On a single run of a 154MB TSV file it consumes 2GB of memory before I stop it.在我停止它之前,单次运行 154MB TSV 文件会消耗 2GB 的 memory。

The code is below, can someone please help me identify what I'm missing with Python?代码如下,有人可以帮我确定 Python 缺少什么吗?

import csv
import datetime
import re

def formatAddress(row):
    address = ''
    if str(row['RES_HOUSE']).strip():
        address += str(row['RES_HOUSE']).strip()
    if str(row['RES_FRAC']).strip():
        address += '-' + str(row['RES_FRAC']).strip()
    if str(row['RES STREET']).strip():
        address += ' ' + str(row['RES STREET']).strip()
    if str(row['RES_APT']).strip():
        address += ' APT ' + str(row['RES_APT']).strip()
    return address

vote_type_map = {
    'G': 'General',
    'P': 'Primary',
    'L': 'Special'
}

def formatRow(row, fieldnames):
    basic_dict = {
        'Voter ID': str(row['VOTER ID']).strip(),
        'Date Registered': str(row['REGISTERED']).strip(),
        'First Name': str(row['FIRSTNAME']).strip(),
        'Last Name': str(row['LASTNAME']).strip(),
        'Middle Initial': str(row['MIDDLE']).strip(),
        'Name Suffix': str(row['SUFFIX']).strip(),
        'Voter Status': str(row['STATUS']).strip(),
        'Current Party Affiliation': str(row['PARTY']).strip(),
        'Year Born': str(row['DATE OF BIRTH']).strip(),
        #'Voter Address': formatAddress(row),
        'Voter Address': formatAddress({'RES_HOUSE': row['RES_HOUSE'], 'RES_FRAC': row['RES_FRAC'], 'RES STREET': row['RES STREET'], 'RES_APT': row['RES_APT']}),
        'City': str(row['RES_CITY']).strip(),
        'State': str(row['RES_STATE']).strip(),
        'Zip Code': str(row['RES_ZIP']).strip(),
        'Precinct': str(row['PRECINCT']).strip(),
        'Precinct Split': str(row['PRECINCT SPLIT']).strip(),
        'State House District': str(row['HOUSE']).strip(),
        'State Senate District': str(row['SENATE']).strip(),
        'Federal Congressional District': str(row['CONGRESSIONAL']).strip(),
        'City or Village Code': str(row['CITY OR VILLAGE']).strip(),
        'Township': str(row['TOWNSHIP']).strip(),
        'School District': str(row['SCHOOL']).strip(),
        'Fire': str(row['FIRE']).strip(),
        'Police': str(row['POLICE']).strip(),
        'Park': str(row['PARK']).strip(),
        'Road': str(row['ROAD']).strip()
    }

    for field in fieldnames:
        m = re.search('(\d{2})(\d{4})-([GPL])', field)
        if m:
            vote_type = vote_type_map[m.group(3)] or 'Other'
            #print { 'k1': m.group(1), 'k2': m.group(2), 'k3': m.group(3)}
            d = datetime.date(year=int(m.group(2)), month=int(m.group(1)), day=1)
            csv_label = d.strftime('%B %Y') + ' ' + vote_type + ' Ballot Requested'
            d = None
            basic_dict[csv_label] = row[field]
        m = None

    return basic_dict

output_rows = []
output_fields = []
with open('data.tsv', 'r') as f:
    r = csv.DictReader(f, delimiter='\t')
    #f.seek(0)
    fieldnames = r.fieldnames
    for row in r:
        output_rows.append(formatRow(row, fieldnames))
f.close()

if output_rows:
    output_fields = sorted(output_rows[0].keys())
    with open('data_out.csv', 'wb') as f:
        w = csv.DictWriter(f, output_fields, quotechar='"')
        w.writeheader()
        for row in output_rows:
            w.writerow(row)
    f.close()

You are accumulating all the data into a huge list, output_rows . 您正在将所有数据累积到一个巨大的列表output_rows You need to process each row as you read it, instead of saving all of them into a memory-expensive Python list. 您需要在读取每一行时对其进行处理,而不是将所有行都保存到内存昂贵的Python列表中。

with open('data.tsv', 'rb') as fin, with open('data_out.csv', 'wb') as fout:
    reader = csv.DictReader(fin, delimiter='\t')
    firstrow = next(r)
    fieldnames = reader.fieldnames
    basic_dict = formatRow(firstrow, fieldnames)
    output_fields = sorted(basic_dict.keys())
    writer = csv.DictWriter(fout, output_fields, quotechar='"')
    writer.writeheader()
    writer.writerow(basic_dict)
    for row in reader:
        basic_dict = formatRow(row, fieldnames)        
        writer.writerow(basic_dict)

You're not leaking any memory, you're just using a ton of memory. 您没有泄漏任何内存,只是在使用大量内存。

You're turning each line of text into a dict of Python strings, which takes considerably more memory than a single string. 您要将每一行文本转换成Python字符串的字典,这比单个字符串占用更多的内存。 For full details, see Why does my 100MB file take 1GB of memory? 有关详细信息,请参阅为什么我的100MB文件占用1GB内存?

The solution is to do this iteratively. 解决方案是迭代地执行此操作。 You don't actually need the whole list, because you never refer back to any previous values. 实际上,您不需要整个列表,因为您永远不会引用任何先前的值。 So: 所以:

with open('data.tsv', 'r') as fin, open('data_out.csv', 'w') as fout:
    r = csv.DictReader(fin, delimiter='\t')
    output_fields = sorted(r.fieldnames)
    w = csv.DictWriter(fout, output_fields, quotechar='"')
    w.writeheader()
    for row in r:
        w.writerow(formatRow(row, fieldnames))

Or, even more simply: 或者,更简单地说:

    w.writerows(formatRow(row, fieldnames) for row in r)

Of course this is slightly different from you original code in that it creates the output file even if the input file is empty. 当然,这与您的原始代码稍有不同,即使输入文件为空,它也会创建输出文件。 You can fix that pretty easily if it's important: 如果重要的话,您可以轻松地解决该问题:

with open('data.tsv', 'r') as fin:
    r = csv.DictReader(fin, delimiter='\t')
    first_row = next(r)
    if row:
        with open('data_out.csv', 'wb') as fout:
            output_fields = sorted(r.fieldnames)
            w = csv.DictWriter(fout, output_fields, quotechar='"')
            w.writeheader()
            w.writerow(formatRow(row, fieldnames))
        for row in r:
            w.writerow(formatRow(row, fieldnames))

maybe it helps some with an similar problem..也许它可以帮助一些有类似问题的人..

While reading a plain CSV file line by line and deciding by a field if it should be saved in file A or file B, a memory overflow occurred and my kernel died.在逐行读取一个普通的 CSV 文件并通过一个字段决定是否将其保存在文件 A 或文件 B 中时,发生了 memory 溢出,我的 kernel 死了。 I therefore analyzed my memory usage and this small change 1. tripled the iterations by a cut 2. fixed the problem with the memory leackage因此,我分析了我的 memory 用法和这个小改动 1. 通过削减将迭代次数增加了三倍 2. 修复了 memory 泄漏的问题

That was my Code with memory leakage and long runtime那是我的代码,有 memory 泄漏和长时间运行

with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
   input_csv = csv.reader(input_file)
   file_A_csv = csv.writer(file_A)
   file_B_csv = csv.writer(file_B)
   for row in input_file:
       condition_row = row[1]
       if condition_row == 'condition':
           file_A.writerow(row)
       else: 
           file_B.write(row)

BUT if you don't declare the variable (or more variables of your reading file) before like this:但是如果你没有像这样声明变量(或你阅读文件的更多变量):

with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
input_csv = csv.reader(input_file)
file_A_csv = csv.writer(file_A)
file_B_csv = csv.writer(file_B)
for row in input_file:
    if row[1] == 'condition':
        file_A.writerow(row)
    else: 
        file_B.write(row)

I can not explain why this is so, but after some tests I could determine that I am on average 3 times as fast and my RAM is close to zero.我无法解释为什么会这样,但经过一些测试后我可以确定我的平均速度是原来的 3 倍,而我的 RAM 接近于零。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM