I have a DataFrame as below:
source =
HM IM Ratio
A B 50%
A C 20%
A D 30%
E B 40%
E C 20%
E F 40%
H C 50%
H E 10%
H G 40%
G B 80%
G D 10%
J B 10%
J H 80%
J X 5%
J E 5%
I want to know for each item in 'HM' column, what's its percentage of total "C", for instance:
total C% in 'H' = 50%(C) + 10%(E) * 20%(C) = 52%
I build a function by using recursion shown below:
root = ['C']
BPB = []
BPB_ratio = {}
def spB(mat,root,ratio,level,lay):
items = source.loc[source['HM']==mat,'IM'].tolist()
for item in items:
items_item = source.loc[source['HM']==item,'IM'].tolist()
item_ratio = source.loc[(source['HM']==mat)&(source['IM']==item),'Ratio'].tolist()[0]
BPB.append([level,item,ratio*item_ratio])
if item in root:
BPB_ratio[level] =+ ratio*item_ratio
continue
if len(items_item)==0:
continue
else:
nlevel = level + 1
spB(item,root,ratio*item_ratio,nlevel,lay)
if lay == 0:
return sum(BPB_ratio.values())
else:
return BPB_ratio[lay]
for ss in list(set(source['HM'].tolist())):
percent = spB(ss,root,1,0,0)
print(BPB_ratio)
It can give me correct results;However, its efficiency is too slow....I have a source DataFrame with nearly 60,000 rows. It will take extremely long time to traverse entire dataframe to give the result. I wonder whether there are better solutions than using recursion?
I would try to use merge
on the dataframe instead of using recursion.
First I would define a function that computes paths with one intermediate step from your dataframe:
def onestep(df):
df2 = df.merge(df, left_on='IM', right_on='HM')
df2['Ratio'] = df2['Ratio_x'] * df2['Ratio_y'] # compute resulting ratio
# only keep relevant columns and rename them
df2 = df2[['HM_x', 'IM_y', 'Ratio']].rename(
columns={'HM_x': 'HM', 'IM_y': 'IM'})
# sum up paths with same origin and destination
return df2.groupby(['HM', 'IM']).sum().reset_index()
With your sample, we can see:
>>> onestep(df)
HM IM Ratio
0 H B 0.36
1 H C 0.02
2 H D 0.04
3 H F 0.04
4 J B 0.02
5 J C 0.41
6 J E 0.08
7 J F 0.02
8 J G 0.32
We correctly get H->C (through E) at 2%
Then I would try to iterate on onestep until the resulting dataframe is empty (or a maximum depth is reached), and finally combine everything:
dfs = [df]
temp=df
n = 10 # give up at depth 10 (adapt it to your actual use case)
for i in range(n):
temp = onestep(temp)
if (len(temp) == 0): # break when the df is empty
break
dfs.append(temp)
else:
# we gave up before exploring all the paths: warn user
print(f"BEWARE: exiting after {n} steps")
resul = pd.concat(dfs, ignore_index=True).groupby(
['HM', 'IM']).sum().reset_index()
With your sample data it gives (iteration at step 2 gave an empty dataframe):
HM IM Ratio
0 A B 0.50
1 A C 0.20
2 A D 0.30
3 E B 0.40
4 E C 0.20
5 E F 0.40
6 G B 0.80
7 G D 0.10
8 H B 0.36
9 H C 0.52
10 H D 0.04
11 H E 0.10
12 H F 0.04
13 H G 0.40
14 J B 0.12
15 J C 0.41
16 J E 0.13
17 J F 0.02
18 J G 0.32
19 J H 0.80
20 J X 0.05
And we correctly find H->C at 52%
I cannot be sure of the real efficiency on a large dataframe, because if will depend on the actual graph complexity...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.