在dask中将元素除以groupby的总和，而不为每列设置索引

Question

I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index() , what is the best work around?我在 pandas 中实现了代码，但是由于我需要使用set_index()转换为 dask 时遇到了问题，最好的解决方法是什么？ Using dask because I need to scale this to much larger dataframes.使用 dask 因为我需要将其扩展到更大的数据帧。

I am looking to return a dataframe where each element is divided by the column-wise sum of a group.我希望返回一个数据框，其中每个元素除以组的按列总和。 Example dataframe that looks like this看起来像这样的示例数据框

df = [
    [1,4,2,1],
    [4,4,0,-1],
    [2,3,1,6],
    [-2,1,0,-1],
    [6,-3,-2,-1],
    [1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df

    0    1    2    3    lab_id 
0   1    4    2    1    a
1   4    4    0   -1    b
2   2    3    1    6    a
3  -2    1    0   -1    b
4   6   -3   -2   -1    a
5   1    0    5    5    c

Currently in pandas I do a groupby by sum to return a dataframe:目前在熊猫中，我按总和执行 groupby 以返回数据框：

sum_df = df.groupby('lab_id').sum()
sum_df

       0    1   2   3
lab_id              
a      9    4   1   6
b      2    5   0   -2
c      1    0   5   5

And then I set the index of the original data frame and divide by the sum dataframe:然后我设置原始数据帧的索引并除以总和数据帧：

df.set_index('lab_id')/sum_df


           0    1        2      3
lab_id              
a   0.111111    1.00     2.0    0.166667
a   0.222222    0.75     1.0    1.000000
a   0.666667    -0.75    -2.0   -0.166667
b   2.000000    0.80     NaN    0.500000
b   -1.000000   0.20     NaN    0.500000
c   1.000000    NaN      1.0    1.000000

The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods.主要问题是我在 dask 中设置索引时遇到了一个巨大的问题，其中明确提到要避免使用set_index()和reset_index()方法。 I simply can't find a way around doing so!我根本找不到这样做的方法！

I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).我尝试了许多神秘的方法来在 dask 之外设置索引，例如创建一个已设置索引的新数据帧和一行虚拟数据，并迭代地分配旧数据帧中的列（这是我写过的一些最糟糕的代码）。

Answer 1

Try with transform尝试transform

df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]: 
          0     1    2         3 lab_id
0  0.111111  1.00  2.0  0.166667      a
1  2.000000  0.80  NaN  0.500000      b
2  0.222222  0.75  1.0  1.000000      a
3 -1.000000  0.20  NaN  0.500000      b
4  0.666667 -0.75 -2.0 -0.166667      a
5  1.000000   NaN  1.0  1.000000      c

在dask中将元素除以groupby的总和，而不为每列设置索引

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-20 18:57:40

在dask中将元素除以groupby的总和，而不为每列设置索引

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-20 18:57:40

解决方案1
1 已采纳 2022-05-20 18:57:40