简体   繁体   中英

which way is best to read the parquet file to process as dask dataframe

I have directory with small parquet files(600), I wanted to do ETL on those parquet and merge those parquet to 128mb each file. what is the optimal way to process the data.

should I read each file in the parquet directory and concat as a single data frame and do groupBY? or provide parquet directory name to dd.read_parquet and process it?

I feel like, when I read the file by file, it creates a very large dask graph that cannot be fit as an image. I guess it will work with those many numbers of threads as well? which leads to a memory error.

which way is best to read the parquet file to process as dask dataframe? file by file or provide entire directory??

Unfortunately there is no single best eway to read a Parquet file for all situations. In order to properly answer the question you will need to know more about your situation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM