简体   繁体   English

从电子表格中以 Python 方式创建邻接矩阵

[英]Pythonically create adjacency matrix from spreadsheet

I have a spreadsheet with lists of names of people that a particular person reported working with on a number of projects.我有一个电子表格,其中包含特定人员报告的在多个项目中与之合作的人员姓名列表。 If I import it to pandas as a dataframe it will look like this:如果我将其作为 dataframe 导入到 pandas,它将如下所示:

       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']

I want to create an adjacency matrix that will look like this:我想创建一个如下所示的邻接矩阵:

      Jane  Fred  Terry  Bob  Alex  Joe
Jane  0     2     0      1    0     2
Fred  1     0     0      0    1     0
Terry 0     0     0      1    0     0
Bob   1     0     1      0    0     1
Alex  0     1     0      0    0     0
Joe   2     0     0      0    0     0

This matrix, generally, will NOT be symmetric because of inconsistency with people's reports.由于与人们的报告不一致,该矩阵通常不会对称。 I have been creating the adjacency matrix just by looping through the dataframe and incrementing the the matrix elements accordingly.我一直在通过循环 dataframe 并相应地增加矩阵元素来创建邻接矩阵。 Apparently, looping through dataframes is NOT recommended and inefficient, so does anyone have a suggestion on how his could be done more pythonically?显然,不推荐循环遍历数据帧并且效率低下,所以有没有人建议如何更 pythonically 地完成他的工作?

This is the sample of the data I used to work with.这是我曾经使用过的数据样本。

df = pd.DataFrame({
    'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
    '1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
    '2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})

df.head()
    Name            1                 2
0   Jane  [Fred, Joe]  [Joe, Fred, Bob]
1   Fred       [Alex]            [Jane]
2  Terry          NaN             [Bob]
3    Bob        [Joe]     [Jane, Terry]
4   Alex       [Fred]               NaN

I created the adjacency matrix using pandas in three simple steps.我通过三个简单的步骤使用 pandas 创建了邻接矩阵。

First, I melted the data to have one column only for all the connections between the different names and dropped the variable column.首先,我将数据融合为只有一列用于不同名称之间的所有连接,并删除了变量列。

dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
     Name             value
0    Jane       [Fred, Joe]
1    Fred            [Alex]
2   Terry               NaN
3     Bob             [Joe]
4    Alex            [Fred]
5     Joe            [Jane]
6    Jane  [Joe, Fred, Bob]
7    Fred            [Jane]
8   Terry             [Bob]
9     Bob     [Jane, Terry]
10   Alex               NaN
11    Joe            [Jane]

Secondly, I used the explode method to break down the rows with lists in separate rows.其次,我使用 explode 方法将行分解为单独的行中的列表。

dff = dff.explode('value')
     Name  value
0    Jane   Fred
0    Jane    Joe
1    Fred   Alex
2   Terry    NaN
3     Bob    Joe
4    Alex   Fred
5     Joe   Jane
6    Jane    Joe
6    Jane   Fred
6    Jane    Bob
7    Fred   Jane
8   Terry    Bob
9     Bob   Jane
9     Bob  Terry
10   Alex    NaN
11    Joe   Jane

Finally, to create the adjacency matrix I used crosstab within pandas which counts the occurrences in the two columns specified only.最后,为了创建邻接矩阵,我在 pandas 中使用了交叉表,它仅计算指定的两列中的出现次数。

pd.crosstab(dff['Name'], dff['value'])
value  Alex  Bob  Fred  Jane  Joe  Terry
Name                                    
Alex      0    0     1     0    0      0
Bob       0    0     0     1    1      1
Fred      1    0     0     1    0      0
Jane      0    1     2     0    2      0
Joe       0    0     0     2    0      0
Terry     0    1     0     0    0      0

Here is one approach:这是一种方法:

import pandas as pd
import ast

data = '''       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']'''

df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2

df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs

df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot

Output: Output:

index指数 Alex亚历克斯 Bob鲍勃 Fred弗雷德 Jane Joe Terry特里
Alex亚历克斯 0 0 0 0 1 1个 0 0 0 0 0 0
Bob鲍勃 0 0 0 0 0 0 1 1个 1 1个 1 1个
Fred弗雷德 1 1个 0 0 0 0 1 1个 0 0 0 0
Jane 0 0 1 1个 2 2个 0 0 2 2个 0 0
Joe 0 0 0 0 0 0 2 2个 0 0 0 0
Terry特里 0 0 1 1个 0 0 0 0 0 0 0 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM