简体繁体 English

如何将dbf文件转换为dask数据框？

[英]How to convert a dbf file to a dask dataframe?

原文 2018-12-07 12:12:58 2 1 python/ dataframe/ dask/ dbf

I have a big dbf file, converting it to a pandas dataframe is taking a lot of time. 我有一个很大的dbf文件，将其转换为熊猫数据帧需要花费大量时间。 Is there a way to convert the file into a dask dataframe? 有没有一种方法可以将文件转换为dask数据框？

1 个解决方案

Dask does not have a dbf loading method. Dask没有dbf加载方法。

As far as I can tell, dbf files do not support random-access to the data, so it is not possible to read from sections of the file in separate workers, in parallel. 据我所知，dbf文件不支持对数据的随机访问，因此不可能从单独的工作线程中并行读取文件的各个部分。 I may be wrong about this, but certainly dbfreader makes no mention of jumping through to an arbitrary record. 我对此可能是错的，但是dbfreader当然没有提到跳转到任意记录。

Therefore, the only way you could read from dbf in parallel, and hope to see a speed increase, would be to split your original data into multiple dbf files, and use dask.delayed to read each of them. 因此，您可以并行从dbf读取并希望看到速度提高的唯一方法是将原始数据拆分为多个dbf文件，并使用dask.delayed读取每个文件。

It is worth mentioning, that probably the reason dbfreader is slow (but please, do your own profiling!) is that it's doing byte-by-byte manipulations and making python objects for every record before passing the records to pandas. 值得一提的是，dbfreader速度较慢的原因（但请您自己进行分析！）是在将记录传递给熊猫之前，它正在逐字节操作并为每个记录创建python对象。 If you really wanted to speed things up, this code should be converted to cython or maybe numba, and a pre-allocated dataframe assigned into. 如果您真的想加快速度，则应将此代码转换为cython或numba，并分配一个预分配的数据帧。