如何在 dask.DataFrame 中获取一组的第一项？

Question

I want to get the first item of each set of different entries of a columns containing IDs.我想获取包含 ID 的列的每组不同条目的第一项。 It works with pandas, but not in dask, as I cannot sort with multiple columns and the .head aggregation is not implemented.它适用于 pandas，但在 dask 中不起作用，因为我无法对多列进行排序并且未实现.head聚合。 Is there another way of getting the desired result?还有另一种方法可以获得预期的结果吗？

Here is the mimimal example for pandas, where everything works fine这是 pandas 的最小示例，一切正常

import pandas as pd

t=pd.DataFrame([[1,2,"ij"],[1,2,"huHU"],[2,4],[2,9],[0,17],[0,2],[1,8],[1,-18]],columns=["particleID","distZ","someothercols"])
tz = ( 
        t
        .sort_values(["particleID","distZ"],axis=0)
        .groupby(["particleID"])
        .head(1)
    )
print(t)
print(tz)

But in dask, see below, I get a NotImplementedError .但是很快，见下文，我得到了一个NotImplementedError 。

import dask.dataframe as dd

t2=dd.from_pandas(t,npartitions=2)
tz2 = ( 
        t2
        .sort_values(["particleID","distZ"],axis=0)
        .groupby(["particleID"])
        .head(1)
    )

print(t2.compute())

I could get the pandas result with this code, but it seems quite inefficient, since I have a needless sort first.我可以用这段代码得到 pandas 结果，但它看起来效率很低，因为我首先进行了不必要的排序。 Also, in my real application, I need more then one row per group and head does not work with dask另外，在我的实际应用程序中，每组我需要多于一行，并且 head 不能与 dask 一起使用

tz2 = ( 
        t2
        .sort_values(["distZ"],axis=0)
        .sort_values(["particleID"],axis=0)
        .groupby(["particleID"])
        .first()
    )

print(t2.compute())
print(tz2.compute())

Background: I want to convince everyone to with from SAS to python and pandas. However, we have some very large datasets and this is a very common application.背景：我想说服大家从 SAS 到 python 和 pandas。但是，我们有一些非常大的数据集，这是一个非常常见的应用程序。 In SAS it is quite easy with if first .在 SAS 中，使用if first非常容易。

Answer 1

It's likely that the NotImplementedError is raised by .sort_values since right now dask.dataframe only implements sorting on a single column value, see docs . NotImplementedError很可能是由.sort_values引发的，因为现在dask.dataframe仅实现对单个列值的排序，请参阅文档。

Answer 2

The solution is dask.groupby.apply with a function that works on a DataFrame of each group.解决方案是dask.groupby.apply与 function 工作在每个组的 DataFrame 上。

import dask.dataframe as dd

t2=dd.from_pandas(t,npartitions=2)
tz2 = ( 
        t2
        .sort_values(["particleID"],axis=0)
        .groupby(["particleID"])
        .apply(lambda s: s.sort_values(["distZ"],axis=0).head(2),
            meta={"particleID":"int", "distZ":"int", "someothercols":"object"})
    )
print(t2.compute())
print(tz2.compute())

如何在 dask.DataFrame 中获取一组的第一项？

问题描述

2 个解决方案

解决方案1
0 2022-11-12 12:46:31

解决方案2
0 2022-11-16 05:49:01

如何在 dask.DataFrame 中获取一组的第一项？

问题描述

2 个解决方案

解决方案1 0 2022-11-12 12:46:31

解决方案2 0 2022-11-16 05:49:01

解决方案1
0 2022-11-12 12:46:31

解决方案2
0 2022-11-16 05:49:01