简体   繁体   English

如何从Python中的字典中获取对称矩阵

[英]How to obtain symmetrical matrix from dictionary in Python

I have a basic question about data manipulation in Python.我有一个关于 Python 中数据操作的基本问题。

I have the following dictionary:我有以下字典:

mydict={('A', 'E'): 23972,
 ('A', 'D'): 10730,
 ('A', 'B'): 14748,
 ('A', 'C'): 3424,
 ('E', 'D'): 3294,
 ('E', 'B'): 16016,
 ('E', 'C'): 3373,
 ('D', 'B'): 69734,
 ('D', 'C'): 4662,
 ('B', 'C'): 159161}

If you look carefully, this is half of a symmetrical matrix with null diagonal (the 0s are not included).如果仔细观察,这是具有空对角线的对称矩阵的一半(不包括 0)。 My final goal is to write a pandas dataframe with the full matrix.我的最终目标是编写一个带有完整矩阵的 Pandas 数据框。

Tentative solution暂定方案

I thought about "unpacking" the dictionary obtaining 5 lists, one per label, with all the values related to the other labels, adding a 0 on the self-position of the list.我想过“解包”字典获得 5 个列表,每个标签一个,所有值都与其他标签相关,在列表的自我位置上添加一个 0。 For label "A" and "B", the desired result would be:对于标签“A”和“B”,期望的结果是:

A=[0,mydict(['A','B']),mydict(['A','C']),mydict(['A','D']),mydict(['A','E'])]
B=[mydict(['A','B']),0,mydict(['B','C']),mydict(['D','B']),mydict(['E','B'])]

and so on for C,D,E.依此类推 C、D、E。 Notice that, in B, 4th and 5th elements are mydict(['D','B']) and mydict(['E','B']), because mydict(['B','D']) and mydict(['B','E']) simply don't exist in mydict.请注意,在 B 中,第 4 和第 5 个元素是 mydict(['D','B']) 和 mydict(['E','B']),因为 mydict(['B','D'])而 mydict(['B','E']) 在 mydict 中根本不存在。

This way I could easily populate a dataframe from these lists:通过这种方式,我可以轻松地从这些列表中填充数据框:

import pandas as pd
df=pd.DataFrame(columns=['A','B','C','D','E'])
df['A']=A
df['B']=B

Question

I am not quite sure about how I can "unpack" mydict into those lists, or into any other container that could help me building the matrix.我不太确定如何将 mydict “解包”到这些列表中,或者可以帮助我构建矩阵的任何其他容器中。 Any suggestions?有什么建议?

One option is to reconstruct the dictionary in full matrix format and then pivot it with pandas:一种选择是以完整矩阵格式重建字典,然后使用熊猫对其进行旋转:

import pandas as pd
mydict={('A', 'E'): 23972,
 ('A', 'D'): 10730,
 ('A', 'B'): 14748,
 ('A', 'C'): 3424,
 ('E', 'D'): 3294,
 ('E', 'B'): 16016,
 ('E', 'C'): 3373,
 ('D', 'B'): 69734,
 ('D', 'C'): 4662,
 ('B', 'C'): 159161}
 
 
# construct the full dictionary
newdict = {}

for (k1, k2), v in mydict.items():
    newdict[k1, k2] = v
    newdict[k2, k1] = v
    newdict[k1, k1] = 0
    newdict[k2, k2] = 0

# pivot the result from long to wide
pd.Series(newdict).reset_index().pivot(index='level_0', columns='level_1', values=0)

#level_1      A       B       C      D      E
#level_0                                     
#A            0   14748    3424  10730  23972
#B        14748       0  159161  69734  16016
#C         3424  159161       0   4662   3373
#D        10730   69734    4662      0   3294
#E        23972   16016    3373   3294      0

Or as commented by @Ch3steR, you can also just do pd.Series(newdict).unstack() for the pivot.或者正如@Ch3steR 所评论的那样,您也可以只为枢轴执行pd.Series(newdict).unstack()

Demo link 演示链接

What I can think of is populate the dict values to an array first then construct dataframe.我能想到的是首先将 dict 值填充到数组中,然后构造数据帧。

mydict={('A', 'E'): 23972,
 ('A', 'D'): 10730,
 ('A', 'B'): 14748,
 ('A', 'C'): 3424,
 ('E', 'D'): 3294,
 ('E', 'B'): 16016,
 ('E', 'C'): 3373,
 ('D', 'B'): 69734,
 ('D', 'C'): 4662,
 ('B', 'C'): 159161}
 
import numpy as np
import pandas as pd

a = np.full((5,5),0)
ss = 'ABCDE'

for k, i in mydict.items():
    f,s = k 
    fi = ss.index(f)
    si = ss.index(s)
    a[fi,si] = i
    a[si,fi] = i

# if you want to keep the diagonal
df = pd.DataFrame(a)

# if you want to remove diagonal:
no_diag = np.delete(a,range(0,a.shape[0]**2,(a.shape[0]+1))).reshape(a.shape[0],(a.shape[1]-1))

df = pd.DataFrame(no_diag)

Here is a straight forward solution which should not take too much time to run as well -这是一个直接的解决方案,它也不应该花费太多时间来运行 -

cols = np.unique(list(mydict.keys())).ravel()

df = pd.DataFrame(0, columns=cols, index=cols)

for i in mydict.items():
    df.loc[i[0]] = i[1] 

df = df + df.T
print(df)
       A       B       C      D      E
A      0   14748    3424  10730  23972
B  14748       0  159161  69734  16016
C   3424  159161       0   4662   3373
D  10730   69734    4662      0   3294
E  23972   16016    3373   3294      0

Benchmarks基准

Adding Benchmarks (303 length input, MacBook pro 13)-添加基准(303 长度输入,MacBook pro 13)-

kk = 'ABCDEFGHIJKLMNOPQURSUVWXYZ'
mydict = {i:np.random.randint(1,10000) for i in itertools.combinations(kk,2)}
len(mydict)
#303
  • fusion's approach - 392 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)融合的方法- 每个循环 392 µs ± 16.4 µs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)
  • Psidom's approach - 4.95 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Psidom 的方法- 每个循环 4.95 ms ± 286 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)
  • Akshay Sehgal's approach - 34.8 ms ± 884 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Akshay Sehgal 的方法- 每个循环 34.8 ms ± 884 µs(平均值 ± 标准偏差,7 次运行,每次 10 次循环)
  • Ben.T's approach - 4.01 ms ± 282 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Ben.T 的方法- 每个循环 4.01 ms ± 282 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)

Fusion's approach is the fastest by a long shot. Fusion 的方法是最快的。

once create a serie form the dictionary and then unstack to get a dataframe.一旦创建一个系列形式的字典,然后unstack以获取数据帧。 Get union of index and columns to be able to reindex both with all possible values.获取索引和列的并union ,以便能够使用所有可能的值重新reindex两者。 Add the transpose of this dataframe to itself for missing values.将此数据帧的转置添加到自身以获取缺失值。

df_ = pd.Series(mydict).unstack(fill_value=0)
idx = df_.index.union(df_.columns)
df_ = df_.reindex(index=idx, columns=idx, fill_value=0)
df_ += df_.T

print(df_)
       A       B       C      D      E
A      0   14748    3424  10730  23972
B  14748       0  159161  69734  16016
C   3424  159161       0   4662   3373
D  10730   69734    4662      0   3294
E  23972   16016    3373   3294      0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM