简体   繁体   English

在python中加载.csv文件会占用太多内存

[英]Loading a .csv file in python consuming too much memory

I'm loading a 4gb .csv file in python. 我正在python中加载4gb .csv文件。 Since it's 4gb I expected it would be ok to load it at once, but after a while my 32gb of ram gets completely filled. 由于它的容量为4gb,所以我希望可以一​​次加载它,但是过了一会儿,我的32gb的内存就完全装满了。

Am I doing something wrong? 难道我做错了什么? Why does 4gb is becoming so much larger in ram aspects? 为什么4GB在ram方面变得越来越大?

Is there a faster way of loading this data? 是否有更快的方式加载此数据?

fname = "E:\Data\Data.csv" 
a = []  
with open(fname) as csvfile:
    reader = csv.reader(csvfile,delimiter=',')
    cont = 0;
    for row in reader:
        cont = cont+1
        print(cont)
        a.append(row)
b = np.asarray(a)

You copied the entire content of the csv at least twice. 您至少复制了csv的全部内容。

Once in a and again in b . 一次在a ,一次在b

Any additional work over that data consumes additional memory to hold values and such 关于该数据的任何其他工作都会消耗额外的内存来保存值,因此

You could del a once you have b , but note that pandas library provides you a read_csv function and a way to count the rows of the generated Dataframe 您一旦拥有b就可以del a b ,但是请注意, pandas库为您提供了read_csv函数和一种对生成的Dataframe行进行计数的方法

You should be able to do an 您应该能够

a = list(reader)

which might be a little better. 可能会好一点。

Because it's Python :-D One of easiest approaches: create your own class for rows, which would store data at least in slots , which can save couple hundreds bytes per row (reader put them into dict, which is kinda huge even when empty)..... if go further, than you may try to store binary representation of data. 因为它是Python :-D,所以最简单的方法之一是:为行创建自己的类,该类至少将数据存储在slots ,这样可以每行节省几百个字节(阅读器将它们放入dict,即使为空也是如此) .....如果走得更远,则可以尝试存储数据的二进制表示形式。

But maybe you can process data without saving entire data-array? 但是也许您可以在保存整个数据数组的情况下处理数据? It's will consume significantly less memory. 它会消耗更少的内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM