简体   繁体   中英

Reading Large CSV file using pandas

I have to read and analyse a logging file from CAN which is in CSV format. It has 161180 rows and I'm only separating 566 columns with semicolon. This is the code I have.

import csv
import dtale
import pandas as pd
path = 'C:\Thesis\Log_Files\InputOutput\Input\Test_Log.csv'
raw_data = pd.read_csv(path,engine="python",chunksize = 1000000, sep=";")
df = pd.DataFrame(raw_data)
#df
dtale.show(df)

I have following error when I run the code in Jupyter Notebook and it's encountering with below error message. Please help me with this. Thanks in advance!

MemoryError: Unable to allocate 348. MiB for an array with shape (161180, 566) and data type object

import time
import pandas as pd
import csv
import dtale
chunk_size = 1000000
batch_no=1
for chunk in pd.read_csv("C:\Thesis\Log_Files\InputOutput\Input\Book2.csv",chunksize=chunk_size,sep=";"):
    chunk.to_csv('chunk'+str(batch_no)+'.csv', index=False)
    batch_no+=1
df1 = pd.read_csv('chunk1.csv')    
df1
dtale.show(df1)

I used above code for only 10 rows and 566 columns. Then it's working. If I consider all the rows (161180), it's not working. Could anyone help me with this. Thanks in advance!

I have attached the output here

You are running out of RAM when loading in the datafile. Best option is to split the file into chunks and read in chunks of files

To read the first 999,999 (non-header) rows:

read_csv(..., nrows=999999)

If you want to read rows 1,000,000... 1,999,999

read_csv(..., skiprows=1000000, nrows=999999)

You'll probably also want to use chunksize:

This returns a TextFileReader object for iteration:

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM