大文件阅读问题

Question

I'm trying to read a 13GB csv file in using the following code: 我正在尝试使用以下代码读取13GB csv文件：

chunks=pd.read_csv('filename.csv',chunksize=10000000)
df=pd.DataFrame()
%time df=pd.concat(chunks, ignore_index=True)

I have played with the values of chunksize parameter from 10 ** 3 to 10 ** 7, but everytime I receive a MemoryError . 我使用了chunksize参数的值从10 ** 3到10 ** 7，但每次我收到MemoryError 。 The csv file has about 3.3 Million rows and 1900 columns. csv文件有大约330万行和1900列。

I clearly see that I have 30+GB memory available before I start reading the file, but I'm still getting the MemoryError . 在我开始阅读文件之前，我清楚地看到我有30 + GB内存可用，但我仍然得到了MemoryError 。 How do I fix this? 我该如何解决？

Answer 1

Chunking is doing nothing in the case where you want to read everything in the file. 在你想要读取文件中的所有内容的情况下，Chunking什么都不做。 The whole purpose of chunk is to pre-process the chunk so that you then only work with the data in which you are interested (possibly writing the processed chunk to disk). chunk的整个目的是预处理块，以便您只使用您感兴趣的数据（可能将处理后的块写入磁盘）。 In addition, it appears that your chunk size is larger than the number of rows in your data, meaning that you are reading the whole file in one go anyhow. 此外，您的块大小似乎大于数据中的行数，这意味着您无论如何都要一次性读取整个文件。

As suggested by @MaxU, try sparse data frames, and also use a smaller chunk size (eg 100k): 正如@MaxU所建议的那样，尝试稀疏数据帧，并使用较小的块大小（例如100k）：

chunks = pd.read_csv('filename.csv', chunksize=100000)  # nrows=200000 to test given file size.
df = pd.concat([chunk.to_sparse(fill_value=0) for chunk in chunks])

You may also want to consider something like GraphLab Create which uses SFrames (not limited by RAM). 您可能还想考虑使用SFrame（不受RAM限制）的GraphLab Create之类的东西。

大文件阅读问题

问题描述

1 个解决方案

解决方案1
0 2017-06-27 19:53:33

大文件阅读问题

问题描述

1 个解决方案

解决方案1 0 2017-06-27 19:53:33

解决方案1
0 2017-06-27 19:53:33