简体   繁体   中英

is there a Faster way to write or read in/to with pandas data frame with about 1 million row

I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.

df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)

My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me

Two thoughts:

  • Investigate Dask , which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.

  • If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.

Regards, Ian

That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.

Do NOT write to xlsx, use csv, writing to xlsx is taking long time. Write to csv. It takes a minute on my cheap laptop with SSD.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM