简体   繁体   中英

How to convert a dbf file to a dask dataframe?

I have a big dbf file, converting it to a pandas dataframe is taking a lot of time. Is there a way to convert the file into a dask dataframe?

Dask does not have a dbf loading method.

As far as I can tell, dbf files do not support random-access to the data, so it is not possible to read from sections of the file in separate workers, in parallel. I may be wrong about this, but certainly dbfreader makes no mention of jumping through to an arbitrary record.

Therefore, the only way you could read from dbf in parallel, and hope to see a speed increase, would be to split your original data into multiple dbf files, and use dask.delayed to read each of them.

It is worth mentioning, that probably the reason dbfreader is slow (but please, do your own profiling!) is that it's doing byte-by-byte manipulations and making python objects for every record before passing the records to pandas. If you really wanted to speed things up, this code should be converted to cython or maybe numba, and a pre-allocated dataframe assigned into.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM