简体   繁体   中英

Fastest way to parse large CSV files in Pandas

I am using pandas to analyse the large data files here: http://www.nielda.co.uk/betfair/data/ They are around 100 megs in size.

Each load from csv takes a few seconds, and then more time to convert the dates.

I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.

What fast methods could I use to load/save the data from disk?

As @chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt . I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).

But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently ( docs ).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

And a possibly relevant other question: "Large data" work flows using pandas

One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, eg, mydata = open('myfile.txt').read() ) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)

See the update below before believing what I write underneath

If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.

There are three things to try, though:

  1. Python csv package and csv.reader
  2. NumPy genfromtext
  3. Numpy loadtxt

The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)

Also, the suggestions given you in the comments by crclayton , BKay , and EdChum are good ones.

Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, eg C).

Update: I do believe what chrisb says below, ie the pandas parser is fast.

Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.

Modin is an early-stage project at UC Berkeley's RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows. Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.

pip install modin

if using dask

pip install modin[dask]

import modin by typing

import modin.pandas as pd

It uses all CPU cores to import csv file and it is almost like pandas.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM