简体   繁体   中英

How to obtain symmetrical matrix from dictionary in Python

I have a basic question about data manipulation in Python.

I have the following dictionary:

mydict={('A', 'E'): 23972,
 ('A', 'D'): 10730,
 ('A', 'B'): 14748,
 ('A', 'C'): 3424,
 ('E', 'D'): 3294,
 ('E', 'B'): 16016,
 ('E', 'C'): 3373,
 ('D', 'B'): 69734,
 ('D', 'C'): 4662,
 ('B', 'C'): 159161}

If you look carefully, this is half of a symmetrical matrix with null diagonal (the 0s are not included). My final goal is to write a pandas dataframe with the full matrix.

Tentative solution

I thought about "unpacking" the dictionary obtaining 5 lists, one per label, with all the values related to the other labels, adding a 0 on the self-position of the list. For label "A" and "B", the desired result would be:

A=[0,mydict(['A','B']),mydict(['A','C']),mydict(['A','D']),mydict(['A','E'])]
B=[mydict(['A','B']),0,mydict(['B','C']),mydict(['D','B']),mydict(['E','B'])]

and so on for C,D,E. Notice that, in B, 4th and 5th elements are mydict(['D','B']) and mydict(['E','B']), because mydict(['B','D']) and mydict(['B','E']) simply don't exist in mydict.

This way I could easily populate a dataframe from these lists:

import pandas as pd
df=pd.DataFrame(columns=['A','B','C','D','E'])
df['A']=A
df['B']=B

Question

I am not quite sure about how I can "unpack" mydict into those lists, or into any other container that could help me building the matrix. Any suggestions?

One option is to reconstruct the dictionary in full matrix format and then pivot it with pandas:

import pandas as pd
mydict={('A', 'E'): 23972,
 ('A', 'D'): 10730,
 ('A', 'B'): 14748,
 ('A', 'C'): 3424,
 ('E', 'D'): 3294,
 ('E', 'B'): 16016,
 ('E', 'C'): 3373,
 ('D', 'B'): 69734,
 ('D', 'C'): 4662,
 ('B', 'C'): 159161}
 
 
# construct the full dictionary
newdict = {}

for (k1, k2), v in mydict.items():
    newdict[k1, k2] = v
    newdict[k2, k1] = v
    newdict[k1, k1] = 0
    newdict[k2, k2] = 0

# pivot the result from long to wide
pd.Series(newdict).reset_index().pivot(index='level_0', columns='level_1', values=0)

#level_1      A       B       C      D      E
#level_0                                     
#A            0   14748    3424  10730  23972
#B        14748       0  159161  69734  16016
#C         3424  159161       0   4662   3373
#D        10730   69734    4662      0   3294
#E        23972   16016    3373   3294      0

Or as commented by @Ch3steR, you can also just do pd.Series(newdict).unstack() for the pivot.

Demo link

What I can think of is populate the dict values to an array first then construct dataframe.

mydict={('A', 'E'): 23972,
 ('A', 'D'): 10730,
 ('A', 'B'): 14748,
 ('A', 'C'): 3424,
 ('E', 'D'): 3294,
 ('E', 'B'): 16016,
 ('E', 'C'): 3373,
 ('D', 'B'): 69734,
 ('D', 'C'): 4662,
 ('B', 'C'): 159161}
 
import numpy as np
import pandas as pd

a = np.full((5,5),0)
ss = 'ABCDE'

for k, i in mydict.items():
    f,s = k 
    fi = ss.index(f)
    si = ss.index(s)
    a[fi,si] = i
    a[si,fi] = i

# if you want to keep the diagonal
df = pd.DataFrame(a)

# if you want to remove diagonal:
no_diag = np.delete(a,range(0,a.shape[0]**2,(a.shape[0]+1))).reshape(a.shape[0],(a.shape[1]-1))

df = pd.DataFrame(no_diag)

Here is a straight forward solution which should not take too much time to run as well -

cols = np.unique(list(mydict.keys())).ravel()

df = pd.DataFrame(0, columns=cols, index=cols)

for i in mydict.items():
    df.loc[i[0]] = i[1] 

df = df + df.T
print(df)
       A       B       C      D      E
A      0   14748    3424  10730  23972
B  14748       0  159161  69734  16016
C   3424  159161       0   4662   3373
D  10730   69734    4662      0   3294
E  23972   16016    3373   3294      0

Benchmarks

Adding Benchmarks (303 length input, MacBook pro 13)-

kk = 'ABCDEFGHIJKLMNOPQURSUVWXYZ'
mydict = {i:np.random.randint(1,10000) for i in itertools.combinations(kk,2)}
len(mydict)
#303
  • fusion's approach - 392 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • Psidom's approach - 4.95 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Akshay Sehgal's approach - 34.8 ms ± 884 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Ben.T's approach - 4.01 ms ± 282 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Fusion's approach is the fastest by a long shot.

once create a serie form the dictionary and then unstack to get a dataframe. Get union of index and columns to be able to reindex both with all possible values. Add the transpose of this dataframe to itself for missing values.

df_ = pd.Series(mydict).unstack(fill_value=0)
idx = df_.index.union(df_.columns)
df_ = df_.reindex(index=idx, columns=idx, fill_value=0)
df_ += df_.T

print(df_)
       A       B       C      D      E
A      0   14748    3424  10730  23972
B  14748       0  159161  69734  16016
C   3424  159161       0   4662   3373
D  10730   69734    4662      0   3294
E  23972   16016    3373   3294      0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM