简体   繁体   中英

How to create multi-relational edge-list from pandas dataframe?

I have a pandas data frame like this:

 from itertools import * 
 from pandas as pd
 d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
 df_rel = pd.DataFrame(data=d)
 df_rel
       col1 col2
    0   a   XX
    1   b   XX
    2   c   XY
    3   d   XX
    4   a   YY
    5   b   YY
    6   d   XY

The unique nodes are:

uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)

For each Relationship the source (Src) and destination (Dst) can be generated:

df1 = pd.DataFrame(
    data=list(combinations(uniq_nodes, 2)), 
    columns=['Src', 'Dst'])
df1
  Src   Dst
0   a   b
1   a   c
2   a   d
3   b   c
4   b   d
5   c   d

I need the new dataframe newdf based on the shared elements in col2 of df_rel . The Relationship column comes from the col2 . Thus the desire dataframe with edgelist will be:

newdf

   Src  Dst Relationship
0   a   b   XX
1   a   b   YY
2   a   d   XX
3   c   d   XY

Is there any fastest way to achieve this? The original dataframe has 30,000 rows.

I took this approach. It works but still not very fast for the large dataframe.

 from itertools import * 
 from pandas as pd
 d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
 df_rel = pd.DataFrame(data=d)
 df_rel
       col1 col2
    0   a   XX
    1   b   XX
    2   c   XY
    3   d   XX
    4   a   YY
    5   b   YY
    6   d   XY   

uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
            data=list(combinations(unique_nodes, 2)),
            columns=['Src', 'Dst'])
     
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
           Src Relationship Dst
        0   a           XX   b
        1   a           XX   d
        3   b           XX   d
        5   c           XY   d
        6   a           YY   b

You need to loop through your df1 rows, and find the rows from df_rel that matches the df1['Src'] and df1['Dst'] columns. Once you have the df1['col2'] values of Src and Dst , compare them and if they match create a row in newdf . Try this - check if it performs for large datasets

Data setup (same as yours):

d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)

uniq_nodes = df_rel['col1'].unique()

df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)),  columns=['Src', 'Dst'])

Code:

newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i,  row in df1.iterrows():
    src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
    dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
    for x in src:
        if x in dst:
            newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
                                 ignore_index=True, sort=False)

print(newdf)

Result:

  Src Dst Relationship
0   a   b           XX
1   a   b           YY
2   a   d           XX
3   b   d           XX
4   c   d           XY

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM