简体   繁体   中英

Subsetting Dask DataFrames

Is this a valid way of loading subsets of a dask dataframe to memory:

while i < len_df:
    j = i + batch_size 
    if j > len_df: 
        j = len_df
    subset = df.loc[i:j,'source_country_codes'].compute()

I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute. I am using version 0.15.2

In terms of use cases, this would be a way of loading batches of data to deep learning (say keras).

If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.

for part in df.to_delayed():
    subset = part.compute()

You can roughly control the size by repartitioning beforehand

for part in df.repartition(npartitions=100).to_delayed():
    subset = part.compute()

This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM