在熊猫中加载大型csv文件

Question

I am trying to load csv files in pandas dataframe. 我正在尝试在熊猫数据框中加载csv文件。 However, Python is taking very large amount of memory while loading the files. 但是，Python在加载文件时占用了大量内存。 For example, the size of csv file is 289 MB but the memory usage goes to around 1700 MB while I am trying to load the file. 例如，csv文件的大小为289 MB，但在我尝试加载文件时，内存使用量约为1700 MB。 And at that point, the system shows memory error. 此时，系统显示内存错误。 I have also tried chunk size but the problem persists. 我也尝试过块大小，但问题仍然存在。 Can anyone please show me a way forward? 谁能告诉我前进的方向吗？

Answer 1

OK, first things first, do not confuse disk size and memory size. 好，首先，不要混淆磁盘大小和内存大小。 A csv, in it's core is a plain text file, whereas a pandas dataframe is a complex object loaded in memory. CSV的核心是纯文本文件，而pandas数据框是加载到内存中的复杂对象。 That said, I can't give a statement about your particular case, considering that I don't know what you have in your csv. 就是说，鉴于我不知道您的csv文件中有什么内容，因此我无法就您的特殊情况发表声明。 So instead I'll give you an example with a csv on my computer that has a similar size: 因此，我将为您提供一个在计算机上具有类似大小的csv的示例：

-rw-rw-r--  1 alex users 341M Jan 12  2017 cpromo_2017_01_12_rec.csv

Now reading the CSV: 现在阅读CSV：

>>> import pandas as pd
>>> df = pd.read_csv('cpromo_2017_01_12_rec.csv')
>>> sys:1: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
>>> df.memory_usage(deep=True).sum() / 1024**2
1474.4243307113647

Pandas will attempt to optimize it as much as it can, but it won't be able to do the impossible. 熊猫将尽最大努力对其进行优化，但它不可能做到不可能。 If you are low on memory, this answer is a good place to start . 如果您的内存不足，那么这个答案是一个很好的起点。 Alternatively you could try dask but I think that's too much work for a small csv. 另外，您也可以尝试使用dask，但我认为对于小型csv而言这工作太多了。

Answer 2

You can use the library "dask" 您可以使用库“ dask”
eg: 例如：

# Dataframes implement the Pandas API
import dask.dataframe as dd`<br>
df = dd.read_csv('s3://.../2018-*-*.csv')

Answer 3

try like this - 1) load with dask and then 2) convert to pandas 尝试这样-1）加载dask，然后2）转换为熊猫

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

在熊猫中加载大型csv文件

问题描述

3 个解决方案

解决方案1
0 2018-03-19 09:46:21

解决方案2
0 2018-04-17 11:27:40

解决方案3
0 2018-05-31 12:47:25

在熊猫中加载大型csv文件

问题描述

3 个解决方案

解决方案1 0 2018-03-19 09:46:21

解决方案2 0 2018-04-17 11:27:40

解决方案3 0 2018-05-31 12:47:25

解决方案1
0 2018-03-19 09:46:21

解决方案2
0 2018-04-17 11:27:40

解决方案3
0 2018-05-31 12:47:25