简体   繁体   中英

pandas 0.18: out of memory error when reading CSV file with categoricals

I am trying to read 3GB file (2.5 million rows, mostly categorical (string) data) into Pandas dataframe with read_csv function and get error: out of memory

  • I am on PC with Pandas 0.18 version and 16GB of RAM, so 3GB data should easily fit on 16GB. (Update: This is not a duplicate question)
  • I know that I can provide dtype to improve reading the CSV, but there are too many columns in my data set and I want to load it first, then decide on data type.

The Traceback is:

Traceback (most recent call last):
  File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 9, in <module>
    preprocessing()
  File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 5, in preprocessing
    df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 285, in _read
    return parser.read()
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 747, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1197, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:8011)
  File "pandas/parser.pyx", line 857, in pandas.parser.TextReader._read_rows (pandas/parser.c:9140)
  File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: out of memory

My code:

import pandas as pd
def preprocessing():
    file_path = r'/home/a/Downloads/main_query.txt'
    df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)

The above code produced error message, which I posted above.

I then tried to remove low_memory = False , and everything worked, it only gave warning:

sys:1: DtypeWarning: Columns (17,20,23,24,33,44,58,118,134,
135,137,142,145,146,147) have mixed types.
Specify dtype option on import or set low_memory=False.

UPDATE: in Pandas 0.19.0 it should be possible to specify categorical dtype when using read_csv() method :

pd.read_csv(filename, dtype={'col1': 'category'})

so you may try to use pandas 0.19.0 RC1

OLD answer:

you can read your CSV in chunks and concatenate it to the resulting DF on each step:

chunksize = 10**5
df = pd.DataFrame()

for chunk in (pd.read_csv(filename,
                          dtype={'col1':np.int8, 'col2':np.int32, ...}
                          chunksize=chunksize)
             ):
    df = pd.concat([df, chunk], ignore_index=True)

NOTE: parameter dtype is unsupported with engine='python'

The question is a duplicate:

  1. categoricals read in and stored as string (as opposed to categorical) take tons of memory .
    • (pandas will underreport memory usage for dataframes with strings, unless you use df.info(memory_usage='deep') or df.memory_usage(deep=True) )
  2. As of pandas 0.19, you now don't need to specify each Categorical variable's levels. Just do pd.read_csv(..., dtype={'foo': 'category', 'bar': 'category', ...})
  3. That should solve everything. In the extremely unlikely event you still run out of memory, then also debug like this:
    • only read in a subset of columns, say usecols = ['foo', 'bar', 'baz']
    • only read in a subset of rows (say nrows=1e5 or see also skiprows=... )
    • and iteratively figure out each categorical's levels and how much memory it uses. You don't need to read in all rows or columns to figure out one categorical column's levels.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM