简体   繁体   中英

How to build a efficient function to calculate a specific element's percentage in a multiple deep nested dictionary?

I have a DataFrame as below:

source =

HM    IM    Ratio 
A     B     50%
A     C     20%
A     D     30% 
E     B     40%
E     C     20%
E     F     40%
H     C     50%
H     E     10%
H     G     40%
G     B     80% 
G     D     10%
J     B     10%
J     H     80%
J     X     5%
J     E     5%

I want to know for each item in 'HM' column, what's its percentage of total "C", for instance:

total C% in 'H' = 50%(C) + 10%(E) * 20%(C) = 52%

I build a function by using recursion shown below:

root = ['C']
BPB = []
BPB_ratio = {}
def spB(mat,root,ratio,level,lay):
    items = source.loc[source['HM']==mat,'IM'].tolist()
    for item in items:
        items_item = source.loc[source['HM']==item,'IM'].tolist()
        item_ratio = source.loc[(source['HM']==mat)&(source['IM']==item),'Ratio'].tolist()[0]
        BPB.append([level,item,ratio*item_ratio])
        if item in root:
            BPB_ratio[level] =+ ratio*item_ratio
            continue
        if len(items_item)==0:
            continue
        else:
            nlevel = level + 1
            spB(item,root,ratio*item_ratio,nlevel,lay)
    if lay == 0:
        return sum(BPB_ratio.values())
    else:
        return BPB_ratio[lay]

for ss in list(set(source['HM'].tolist())):
    percent = spB(ss,root,1,0,0)
print(BPB_ratio)

It can give me correct results;However, its efficiency is too slow....I have a source DataFrame with nearly 60,000 rows. It will take extremely long time to traverse entire dataframe to give the result. I wonder whether there are better solutions than using recursion?

I would try to use merge on the dataframe instead of using recursion.

First I would define a function that computes paths with one intermediate step from your dataframe:

def onestep(df):
    df2 = df.merge(df, left_on='IM', right_on='HM')
    df2['Ratio'] = df2['Ratio_x'] * df2['Ratio_y']    # compute resulting ratio
    # only keep relevant columns and rename them
    df2 = df2[['HM_x', 'IM_y', 'Ratio']].rename(
        columns={'HM_x': 'HM', 'IM_y': 'IM'})
    # sum up paths with same origin and destination
    return df2.groupby(['HM', 'IM']).sum().reset_index()

With your sample, we can see:

>>> onestep(df)
  HM IM  Ratio
0  H  B   0.36
1  H  C   0.02
2  H  D   0.04
3  H  F   0.04
4  J  B   0.02
5  J  C   0.41
6  J  E   0.08
7  J  F   0.02
8  J  G   0.32

We correctly get H->C (through E) at 2%

Then I would try to iterate on onestep until the resulting dataframe is empty (or a maximum depth is reached), and finally combine everything:

dfs = [df]
temp=df
n = 10               # give up at depth 10 (adapt it to your actual use case)
for i in range(n):
    temp = onestep(temp)
    if (len(temp) == 0):    # break when the df is empty
        break
    dfs.append(temp)
else:
    # we gave up before exploring all the paths: warn user
    print(f"BEWARE: exiting after {n} steps")

resul = pd.concat(dfs, ignore_index=True).groupby(
    ['HM', 'IM']).sum().reset_index()

With your sample data it gives (iteration at step 2 gave an empty dataframe):

   HM IM  Ratio
0   A  B   0.50
1   A  C   0.20
2   A  D   0.30
3   E  B   0.40
4   E  C   0.20
5   E  F   0.40
6   G  B   0.80
7   G  D   0.10
8   H  B   0.36
9   H  C   0.52
10  H  D   0.04
11  H  E   0.10
12  H  F   0.04
13  H  G   0.40
14  J  B   0.12
15  J  C   0.41
16  J  E   0.13
17  J  F   0.02
18  J  G   0.32
19  J  H   0.80
20  J  X   0.05

And we correctly find H->C at 52%


I cannot be sure of the real efficiency on a large dataframe, because if will depend on the actual graph complexity...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM