Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

Question

I am trying to read a CSV file of 1.2G, which contains 25K records, each consists of a id and a large string.

However, around 10K rows, I get this error:

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

Which seems weird, since the VM has 140GB RAM and at 10K rows the memory usage is only at ~1%.

This is the command I use:

pd.read_csv('file.csv', header=None, names=['id', 'text', 'code'])

I also ran the following dummy program, which could successfully fill up my memory to close to 100%.

list = []
list.append("hello")
while True:
    list.append("hello" + list[len(list) - 1])

Answer 1

This sounds like a job for chunksize . It splits the input process into multiple chunks, reducing the required reading memory.

df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

Answer 2

This error can occur with an invalid csv file, rather than the stated memory error.

I got this error with a file that was much smaller than my available RAM and it turned out that there was an opening double quote on one line without a closing double quote.

In this case, you can check the data, or you can change the quoting behavior of the parser, for example by passing quoting=3 to pd.read_csv .

Answer 3

This is weird.

Actually I ran into the same situation.

df_train = pd.read_csv('./train_set.csv')

But after I tried a lot of stuff to solve this error. And it works. Like this:

dtypes = {'id': pd.np.int8,
          'article':pd.np.str,
          'word_seg':pd.np.str,
          'class':pd.np.int8}
df_train = pd.read_csv('./train_set.csv', dtype=dtypes)
df_test = pd.read_csv('./test_set.csv', dtype=dtypes)

Or this:

ChunkSize = 10000
i = 1
for chunk in pd.read_csv('./train_set.csv', chunksize=ChunkSize): #分块合并
    df_train = chunk if i == 1 else pd.concat([df_train, chunk])
    print('-->Read Chunk...', i)
    i += 1

BUT!!!!!Suddenlly the original version works fine as well!

Like I did some useless work and I still have no idea where really went wrong .

I don't know what to say.

Answer 4

You can use the command df.info(memory_usage="deep") , to find out the memory usage of data being loaded in the data frame.

Few things to reduce Memory:

Only load columns you need in the processing via usecols table.
Set dtypes for these columns
If your dtype is Object / String for some columns, you can try using the dtype="category" . In my experience it reduced the memory usage drastically.

Answer 5

I used the below code to load csv in chunks while removing the intermediate file to manage memory, and view % of loading in real time: Note: 96817414 is the number of rows in my csv

import pandas as pd
import gc
cols=['col_name_1', 'col_name_2', 'col_name_3']
df = pd.DataFrame()
i = 0
for chunk in pd.read_csv('file.csv', chunksize=100000, usecols=cols):
    df = pd.concat([df, chunk], ignore_index=True)
    del chunk; gc.collect()
    i+=1
    if i%5==0:
        print("% of read completed", 100*(i*100000/96817414))

Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

Question

5 answers

solution1
13 ACCPTED 2016-11-06 20:54:08

solution2
2 2017-12-12 02:57:02

solution3
2 2019-01-04 23:50:43

solution4
1 2019-12-31 00:34:12

solution5
0 2020-11-16 14:48:16

Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

Question

5 answers

solution1 13 ACCPTED 2016-11-06 20:54:08

solution2 2 2017-12-12 02:57:02

solution3 2 2019-01-04 23:50:43

solution4 1 2019-12-31 00:34:12

solution5 0 2020-11-16 14:48:16

solution1
13 ACCPTED 2016-11-06 20:54:08

solution2
2 2017-12-12 02:57:02

solution3
2 2019-01-04 23:50:43

solution4
1 2019-12-31 00:34:12

solution5
0 2020-11-16 14:48:16