[英]Divide element by sum of groupby in dask without setting index for every column
I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index()
, what is the best work around?我在 pandas 中实现了代码,但是由于我需要使用
set_index()
转换为 dask 时遇到了问题,最好的解决方法是什么? Using dask because I need to scale this to much larger dataframes.使用 dask 因为我需要将其扩展到更大的数据帧。
I am looking to return a dataframe where each element is divided by the column-wise sum of a group.我希望返回一个数据框,其中每个元素除以组的按列总和。 Example dataframe that looks like this
看起来像这样的示例数据框
df = [
[1,4,2,1],
[4,4,0,-1],
[2,3,1,6],
[-2,1,0,-1],
[6,-3,-2,-1],
[1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df
0 1 2 3 lab_id
0 1 4 2 1 a
1 4 4 0 -1 b
2 2 3 1 6 a
3 -2 1 0 -1 b
4 6 -3 -2 -1 a
5 1 0 5 5 c
Currently in pandas I do a groupby by sum to return a dataframe:目前在熊猫中,我按总和执行 groupby 以返回数据框:
sum_df = df.groupby('lab_id').sum()
sum_df
0 1 2 3
lab_id
a 9 4 1 6
b 2 5 0 -2
c 1 0 5 5
And then I set the index of the original data frame and divide by the sum dataframe:然后我设置原始数据帧的索引并除以总和数据帧:
df.set_index('lab_id')/sum_df
0 1 2 3
lab_id
a 0.111111 1.00 2.0 0.166667
a 0.222222 0.75 1.0 1.000000
a 0.666667 -0.75 -2.0 -0.166667
b 2.000000 0.80 NaN 0.500000
b -1.000000 0.20 NaN 0.500000
c 1.000000 NaN 1.0 1.000000
The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index()
and reset_index()
methods.主要问题是我在 dask 中设置索引时遇到了一个巨大的问题,其中明确提到要避免使用
set_index()
和reset_index()
方法。 I simply can't find a way around doing so!我根本找不到这样做的方法!
I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).我尝试了许多神秘的方法来在 dask 之外设置索引,例如创建一个已设置索引的新数据帧和一行虚拟数据,并迭代地分配旧数据帧中的列(这是我写过的一些最糟糕的代码)。
Try with transform
尝试
transform
df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]:
0 1 2 3 lab_id
0 0.111111 1.00 2.0 0.166667 a
1 2.000000 0.80 NaN 0.500000 b
2 0.222222 0.75 1.0 1.000000 a
3 -1.000000 0.20 NaN 0.500000 b
4 0.666667 -0.75 -2.0 -0.166667 a
5 1.000000 NaN 1.0 1.000000 c
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.