简体   繁体   English

在 Dask dataframe 中查找多列的中值

[英]Find median value across multiple columns in a Dask dataframe

I have a Dask dataframe with three columns, width, height and length.我有一个 Dask dataframe,它有三列,宽度、高度和长度。 I need to create a fourth column, which is the median of the three.我需要创建第四列,这是三列的中位数。

My code with regular pandas df doesn't work as median is not a function in Dask.我的常规 pandas df 代码不起作用,因为中位数不是 Dask 中的 function。

columns_to_sum = ['weight', 'height', 'length']
df['median'] = df[columns_to_sum].median(axis=1)

Any help is appreciated!任何帮助表示赞赏!

Because the median is the middle value in the ordered set of all values, this is slow and difficult to implement for larger-than-memory data structures.因为中位数是所有值的有序集中的中间值,所以对于大于内存的数据结构,这很慢且难以实现。

Dask's dask.DataFrame.quantile implements algorithms for producing approximate quantiles using a variety of algorithms: Dask 的dask.DataFrame.quantile使用多种算法实现生成近似分位数的算法:

df['median'] = df[columns_to_sum].quantile(0.5)

However, as @quasiben pointed out, df[columns_to_sum].mean() will be more efficient even than these approximate algorithms.但是,正如@quasiben 指出的那样, df[columns_to_sum].mean()甚至比这些近似算法更有效。 Also, there are outstanding issues with some of the algorithms, suggesting that dask.DataFrame.quantile can do a very poor job approximating the true quantiles in some edge cases.此外,一些算法存在突出问题,这表明dask.DataFrame.quantile在某些边缘情况下逼近真实分位数的工作非常糟糕。 They're working on it .他们正在努力

While it's true that parallel median is hard, in this case the question asker is asking about median across columns.虽然平行中位数确实很难,但在这种情况下,提问者询问的是跨列的中位数。 This is easy because for every row we have all of the data already in memory.这很容易,因为对于每一行,我们已经在 memory 中拥有所有数据。

If this doesn't already exist then this should be added to Dask Dataframe.如果这尚不存在,则应将其添加到 Dask Dataframe。 If you want to raise an issue at https://github.com/dask/dask/issues/new that would be welcome.如果您想在https://github.com/dask/dask/issues/new提出问题,那将是受欢迎的。

As a short-term workaround, you can always use Pandas functions and map_partitions作为短期解决方法,您始终可以使用 Pandas 函数和 map_partitions

def f(df: pandas.DataFrame, columns: list) -> pandas.DataFrame:
    df = df.copy()  # dask prefers that you not mutate inputs
    df["median"] = df[columns].median(axis=1)

ddf = ddf.map_partitions(f, columns=["a", "b", "c"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM