简体   繁体   中英

Dask: create strictly increasing index

As is well documented, Dask creates a strictly increasing index on a per partition basis when reset_index is called, resulting in duplicate indices over the whole set. What is the best way (eg computationally quickest) to create a strictly increasing index in Dask - which doesn't have to be consecutive - over the whole set? I was hoping map_partitions would pass in the partition number, but I don't think it does. Thanks.

EDIT

Thanks @MRocklin, I've got this far, but I need a little assistance on how to recombine my series with the original dataframe.

def create_increasing_index(ddf:dd.DataFrame):
    mps = int(len(ddf) / ddf.npartitions + 1000)
    values = ddf.index.values

    def do(x, max_partition_size, block_id=None):
        length = len(x)
        if length == 0:
            raise ValueError("Does not work with empty partitions. Consider using dask.repartition.")

        start = block_id[0] * max_partition_size
        return da.arange(start, start+length, chunks=1)

    series = values.map_blocks(do, max_partition_size=mps, dtype=np.int64)
    ddf2 = dd.concat([ddf, dd.from_array(series)], axis=1)
    return ddf2

Where I'm getting the error "ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1". Is there a better way than using dd.concat? Thanks.

EDIT AGAIN

Actually, for my purposes (and amounts of data that I was testing on - only a few gb) cumsum is fast enough. I'll revisit when this becomes too slow!

A rather slow way of accomplishing this would be to create a new column and then use cumsum

ddf['x'] = 1
ddf['x'] = ddf.x.cumsum()
ddf = ddf.set_index('x', sorted=True)

This is neither very slow nor is it free.

Given how your question is phrased I suspect that you are looking to just create a range for each partition that is separated by a very large value that you know to be larger than the largest number of rows. You're right that map_partitions doesn't provide the partition number. You could do one of the two solutions below instead.

  1. Convert to a dask.array (with .values ), use the map_blocks method, which does provide a block index, and then convert back to a series with dd.from_array .
  2. Convert to a list of dask.delayed objects, create the delayed series' yourself, and then convert back to a dask series with dd.from_delayed

http://dask.pydata.org/en/latest/delayed-collections.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM