I have a huge data file I want to process in jupyter notebook. I use pandas in a for loop for that ans specify which lines Im reading from the file:
import pandas as pd
import gc
from tqdm import tqdm
# Create a training file with simple derived features
rowstoread = 150_000
chunks = 50
for chunks in tqdm(range(chunks)):
rowstoskip = range(1, chunks*rowstoread-1) if segment > 0 else 0
chunk = pd.read_csv("datafile.csv", dtype={'attribute_1': np.int16, 'attribute_2': np.float64}, skiprows=rowstoskip, nrows=rowstoread)
x = chunk['attribute_1'].values
y = chunk['attribute_2'].values[-1]
#process data here and try to get rid of memory afterwards
del chunk, x, y
gc.collect()
Although I try to free memory of the data I read afterwards the import starts fast and becomes very slow depending on the number of the current chunk.
Is there something I'm missing? Does someone know the reason for it and how to fix?
Thanks in advance, smaica
Edit: Thanks to @Wen-Ben I can circumvent this issue with the chunk method from pandas read_csv. Nevertheless Im wonderung why this happens
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.