Dask：创建严格增加的索引

Question

As is well documented, Dask creates a strictly increasing index on a per partition basis when reset_index is called, resulting in duplicate indices over the whole set. 正如文档所述，当reset_index时，Dask会在每个分区的基础上创建严格增加的索引，从而导致整个集合上出现重复索引。 What is the best way (eg computationally quickest) to create a strictly increasing index in Dask - which doesn't have to be consecutive - over the whole set? 在整个集合中，在Dask中创建严格增加的索引（不必连续）的最佳方法（例如计算最快）是什么？ I was hoping map_partitions would pass in the partition number, but I don't think it does. 我希望map_partitions会传入分区号，但我认为不会。 Thanks. 谢谢。

EDIT 编辑

Thanks @MRocklin, I've got this far, but I need a little assistance on how to recombine my series with the original dataframe. 谢谢@MRocklin，我已经做到了这一点，但我需要一些帮助来解决如何将我的系列与原始数据帧重新组合。

def create_increasing_index(ddf:dd.DataFrame):
    mps = int(len(ddf) / ddf.npartitions + 1000)
    values = ddf.index.values

    def do(x, max_partition_size, block_id=None):
        length = len(x)
        if length == 0:
            raise ValueError("Does not work with empty partitions. Consider using dask.repartition.")

        start = block_id[0] * max_partition_size
        return da.arange(start, start+length, chunks=1)

    series = values.map_blocks(do, max_partition_size=mps, dtype=np.int64)
    ddf2 = dd.concat([ddf, dd.from_array(series)], axis=1)
    return ddf2

Where I'm getting the error "ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1". 我收到错误“ValueError：无法连接DataFrame与未知分区指定轴= 1”。 Is there a better way than using dd.concat? 有没有比使用dd.concat更好的方法？ Thanks. 谢谢。

EDIT AGAIN 再次编辑

Actually, for my purposes (and amounts of data that I was testing on - only a few gb) cumsum is fast enough. 实际上，为了我的目的（以及我测试的数据量 - 只有几gb），cumsum足够快。 I'll revisit when this becomes too slow! 当这变得太慢时我会重温！

Answer 1

A rather slow way of accomplishing this would be to create a new column and then use cumsum 实现这一目标的一种相当缓慢的方法是创建一个新列，然后使用cumsum

ddf['x'] = 1
ddf['x'] = ddf.x.cumsum()
ddf = ddf.set_index('x', sorted=True)

This is neither very slow nor is it free. 这既不是很慢也不是免费的。

Given how your question is phrased I suspect that you are looking to just create a range for each partition that is separated by a very large value that you know to be larger than the largest number of rows. 鉴于您的问题是如何措辞的，我怀疑您只是想为每个分区创建一个范围，该范围由一个非常大的值分隔，您知道该值大于最大行数。 You're right that map_partitions doesn't provide the partition number. 你是对的， map_partitions没有提供分区号。 You could do one of the two solutions below instead. 您可以改为执行以下两种解决方案之一。

Convert to a dask.array (with .values ), use the map_blocks method, which does provide a block index, and then convert back to a series with dd.from_array . 转换为dask.array（带.values ），使用map_blocks方法，它提供块索引，然后转换回dd.from_array系列。
Convert to a list of dask.delayed objects, create the delayed series' yourself, and then convert back to a dask series with dd.from_delayed 转换为dask.delayed对象列表，自己创建延迟系列，然后使用dd.from_delayed转换回dask系列

http://dask.pydata.org/en/latest/delayed-collections.html http://dask.pydata.org/en/latest/delayed-collections.html

Dask：创建严格增加的索引

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-11-30 12:46:14

Dask：创建严格增加的索引

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-11-30 12:46:14

解决方案1
1 已采纳 2017-11-30 12:46:14