简体   繁体   English

在dask中将元素除以groupby的总和,而不为每列设置索引

[英]Divide element by sum of groupby in dask without setting index for every column

I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index() , what is the best work around?我在 pandas 中实现了代码,但是由于我需要使用set_index()转换为 dask 时遇到了问题,最好的解决方法是什么? Using dask because I need to scale this to much larger dataframes.使用 dask 因为我需要将其扩展到更大的数据帧。

I am looking to return a dataframe where each element is divided by the column-wise sum of a group.我希望返回一个数据框,其中每个元素除以组的按列总和。 Example dataframe that looks like this看起来像这样的示例数据框

df = [
    [1,4,2,1],
    [4,4,0,-1],
    [2,3,1,6],
    [-2,1,0,-1],
    [6,-3,-2,-1],
    [1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df

    0    1    2    3    lab_id 
0   1    4    2    1    a
1   4    4    0   -1    b
2   2    3    1    6    a
3  -2    1    0   -1    b
4   6   -3   -2   -1    a
5   1    0    5    5    c

Currently in pandas I do a groupby by sum to return a dataframe:目前在熊猫中,我按总和执行 groupby 以返回数据框:

sum_df = df.groupby('lab_id').sum()
sum_df

       0    1   2   3
lab_id              
a      9    4   1   6
b      2    5   0   -2
c      1    0   5   5

And then I set the index of the original data frame and divide by the sum dataframe:然后我设置原始数据帧的索引并除以总和数据帧:

df.set_index('lab_id')/sum_df


           0    1        2      3
lab_id              
a   0.111111    1.00     2.0    0.166667
a   0.222222    0.75     1.0    1.000000
a   0.666667    -0.75    -2.0   -0.166667
b   2.000000    0.80     NaN    0.500000
b   -1.000000   0.20     NaN    0.500000
c   1.000000    NaN      1.0    1.000000

The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods.主要问题是我在 dask 中设置索引时遇到了一个巨大的问题,其中明确提到要避免使用set_index()reset_index()方法。 I simply can't find a way around doing so!我根本找不到这样做的方法!

I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).我尝试了许多神秘的方法来在 dask 之外设置索引,例如创建一个已设置索引的新数据帧和一行虚拟数据,并迭代地分配旧数据帧中的列(这是我写过的一些最糟糕的代码)。

Try with transform尝试transform

df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]: 
          0     1    2         3 lab_id
0  0.111111  1.00  2.0  0.166667      a
1  2.000000  0.80  NaN  0.500000      b
2  0.222222  0.75  1.0  1.000000      a
3 -1.000000  0.20  NaN  0.500000      b
4  0.666667 -0.75 -2.0 -0.166667      a
5  1.000000   NaN  1.0  1.000000      c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM