简体   繁体   中英

IPython in jupyter notebooks: reading a large datafile with pandas becomes very slow (high memory consumption?)

I have a huge data file I want to process in jupyter notebook. I use pandas in a for loop for that ans specify which lines Im reading from the file:

import pandas as pd 
import gc
from tqdm import tqdm


# Create a training file with simple derived features
rowstoread = 150_000
chunks = 50

for chunks in tqdm(range(chunks)):
    rowstoskip = range(1, chunks*rowstoread-1) if segment > 0 else 0
    chunk = pd.read_csv("datafile.csv", dtype={'attribute_1': np.int16, 'attribute_2': np.float64}, skiprows=rowstoskip, nrows=rowstoread)

    x = chunk['attribute_1'].values
    y = chunk['attribute_2'].values[-1]

    #process data here and try to get rid of memory afterwards

    del chunk, x, y
    gc.collect()

Although I try to free memory of the data I read afterwards the import starts fast and becomes very slow depending on the number of the current chunk.

Is there something I'm missing? Does someone know the reason for it and how to fix?

Thanks in advance, smaica

Edit: Thanks to @Wen-Ben I can circumvent this issue with the chunk method from pandas read_csv. Nevertheless Im wonderung why this happens

From my experience gc.collect() doesn't do much.

If you have a large file that can fit onto disk, then you can use other libraries such as Sframes .

Here's an example to read a csv file:

sf = SFrame(data='~/mydata/foo.csv')

The API is very similar to Pandas.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM