简体   繁体   English

Python CSV解析会填满内存

[英]Python CSV parsing fills up memory

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB. 我有一个超过一百万行的CSV文件,我试图解析该文件并将行插入数据库。

    with open(file, "rb") as csvfile:

        re = csv.DictReader(csvfile)
        for row in re:
        //insert row['column_name'] into DB

For csv files below 2 MB this works well but anything more than that ends up eating my memory. 对于2 MB以下的csv文件,它可以很好地工作,但除此之外,其他所有事情最终都会占用我的内存。 It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. 可能是因为我将Dictreader的内容存储在称为“ re”的列表中,并且它无法循环访问如此庞大的列表。 I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. 我绝对需要使用其列名访问csv文件,这就是为什么我选择dictreader的原因,因为它可以轻松提供对csv文件的列级访问。 Can anyone tell me why this is happening and how can this be avoided? 谁能告诉我为什么会这样,如何避免呢?

The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak. DictReader不会将整个文件加载到内存中,而是按照DhruvPathak建议的答案中的说明按块读取它。

But depending on your database engine, the actual write on disk may only happen at commit. 但是,取决于您的数据库引擎,实际的磁盘写操作可能仅在提交时发生。 That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it. 这意味着数据库(而不是csv读取器)将所有数据保留在内存中,并最终将其耗尽。

So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory. 因此,您应该尝试提交每n条记录,其中n通常在10到1000之间,具体取决于行的大小和可用内存。

If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. 如果不需要一次使用全部列,则可以像处理文本文件一样简单地逐行读取文件并解析每一行。 The exact parsing depends on your data format but you could do something like: 确切的解析取决于您的数据格式,但是您可以执行以下操作:

delimiter = ','
with open(filename, 'r') as fil:
    headers = fil.next()
    headers = headers.strip().split(delimiter)
    dic_headers = {hdr: headers.index(hdr) for hdr in headers}
    for line in fil:
        row = line.strip().split(delimiter)
        ## do something with row[dic_headers['column_name']]

This is a very simple example but it can be more elaborate. 这是一个非常简单的示例,但可能会更加详尽。 For example, this does not work if your data contains , . 例如,如果您的数据包含,则此方法无效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM