简体   繁体   English

在python中使用networkx创建图形和执行链接预测时出错

[英]Error in creating graph and performing link prediction using networkx in python

I am trying to make a graph using a csv file which has information about edges and profession and age of the nodes.我正在尝试使用 csv 文件制作图表,该文件包含有关节点的边缘、职业和年龄的信息。 I assign communities to each node and performing link prediction.我为每个节点分配社区并执行链接预测。

import networkx as nx
import csv
engineers1 = []
engineers2 = []
engineers3 = []
engineers4 = []
engineers5 = []
actors1= []
actors2= []
actors3= []
actors4= []
actors5= []
writers1 = []
writers2= []
writers3= []
writers4 = []
writers5 = []
doctors1= []
doctors2= []
doctors3= []
doctors4= []
doctors5= []
drivers1=[]
drivers2=[]
drivers3=[]
drivers4=[]
drivers5=[]
teachers1=[]
teachers2=[]
teachers3=[]
teachers4=[]
teachers5=[]
nodes=[]
g=nx.Graph()

for i in range(0,4038):
    g.add_node(i)

with open("asd1.csv",'r') as csv_file:
    csv_reader=csv.DictReader(csv_file)

    for line in csv_reader:
        g.add_edge(line['first'],line['second'])

csv_file.close()

with open("asd1.csv",'r') as csv_file:
    csv_reader=csv.DictReader(csv_file)
    for line in csv_reader:
         if (line['profession'] == 'actor' and line['age'] >= '13' and 
line['age'] <= '17'):
            actors1.append(line['name'])
        if (line['profession'] == 'actor' and line['age'] >= '18' and 
line['age'] <= '29'):
          actors2.append(line['name'])
        if (line['profession'] == 'actor' and line['age'] >= '30' and 
line['age'] <= '49'):
        actors3.append(line['name'])
    if (line['profession'] == 'actor' and line['age'] >= '50' and line['age'] <= '64'):
        actors4.append(line['name'])
    if (line['profession'] == 'actor' and line['age'] >= '65'):
        actors5.append(line['name'])

    if (line['profession'] == 'eng' and line['age'] >= '13' and line['age'] <= '17'):
        engineers1.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '18' and line['age'] <= '29'):
        engineers2.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '30' and line['age'] <= '49'):
        engineers3.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '50' and line['age'] <= '64'):
        engineers4.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '65'):
        engineers5.append(line['name'])

    if (line['profession'] == 'teacher' and line['age'] >= '13' and line['age'] <= '17'):
        teachers1.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '18' and line['age'] <= '29'):
        teachers2.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '30' and line['age'] <= '49'):
        teachers3.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '50' and line['age'] <= '64'):
        teachers4.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '65'):
        teachers5.append(line['name'])

    if (line['profession'] == 'driver' and line['age'] >= '13' and line['age'] <= '17'):
        drivers1.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '18' and line['age'] <= '29'):
        drivers2.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '30' and line['age'] <= '49'):
        drivers3.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '50' and line['age'] <= '64'):
        doctors4.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '65'):
        drivers5.append(line['name'])

    if (line['profession'] == 'doctor' and line['age'] >= '13' and line['age'] <= '17'):
        doctors1.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '18' and line['age'] <= '29'):
        doctors2.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '30' and line['age'] <= '49'):
        doctors3.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '50' and line['age'] <= '64'):
        drivers4.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '65'):
        doctors5.append(line['name'])

csv_file.close()

print("actors having age between 13 and 17: ",actors1) 
print("actors having age between 18 and 29: ",actors2)
print("actors having age between 30 and 49: ",actors3) 
print("actors having age between 50 and 64: ",actors4)
print("actors having age 65 and above: ",actors5)
print('\n')

print("engineers having age between 13 and 17: ",engineers1)
print("engineers having age between 18 and 29: ",engineers2)
print("engineers having age between 30 and 49: ",engineers3)
print("engineers having age between 50 and 64: ",engineers4)
print("engineers having age 65 and above: ",engineers5)
print('\n')

print("teachers having age between 13 and 17: ",teachers1)
print("teachers having age between 18 and 29: ",teachers2)
print("teachers having age between 30 and 49: ",teachers3)
print("teachers having age between 50 and 64: ",teachers4)
print("teachers having age 65 and above: ",teachers5)
print('\n')

print("drivers having age between 13 and 17: ",drivers1)
print("drivers having age between 18 and 29: ",drivers2)
print("drivers having age between 30 and 49: ",drivers3)
print("drivers having age between 50 and 64: ",drivers4)
print("drivers having age 65 and above: ",drivers5)
print('\n')

print("doctors having age between 13 and 17: ",doctors1)
print("doctors having age between 18 and 29: ",doctors2)
print("doctors having age between 30 and 49: ",doctors3)
print("doctors having age between 50 and 64: ",doctors4)
print("doctors having age 65 and above: ",doctors5)
print('\n')

for i in range(0,4038):
    g.node[i]['community']=0

for x1 in actors1:
    g.node[x1]['community']=0
for x2 in actors2:
    g.node[x2]['community']=1 
for x3 in actors3:
    g.node[x3]['community']=2
for x4 in actors4:
    g.node[x4]['community']=3
for x5 in actors5:
    g.node[x5]['community']=4
for x6 in engineers1:
    g.node[x6]['community']=5
for x7 in engineers2:
    g.node[x7]['community']=6
for x8 in engineers3:
    g.node[x8]['community']=7
for x9 in engineers4:
    g.node[x9]['community']=8
for x10 in engineers5:
    g.node[x10]['community']=9
for x11 in teachers1:
    g.node[x11]['community']=10
for x12 in teachers2:
    g.node[x12]['community']=11
for x13 in teachers3:
    g.node[x13]['community']=12
for x14 in teachers4:
    g.node[x14]['community']=13
for x15 in teachers5:
    g.node[x15]['community']=14
for x16 in drivers1:
    g.node[x16]['community']=15
for x17 in drivers2:
    g.node[x17]['community']=16
for x18 in drivers3:
    g.node[x18]['community']=17
for x19 in drivers4:
    g.node[x19]['community']=18
for x20 in drivers5:
    g.node[x20]['community']=19
for x21 in doctors1:
    g.node[x21]['community']=20
for x22 in doctors2:
   g.node[x22]['community']=21
for x23 in doctors3:
    g.node[x23]['community']=22
for x24 in doctors4:
    g.node[x24]['community']=23
for x25 in doctors5:
    g.node[x25]['community']=24

print(g.nodes())
l=list(nx.cn_soundarajan_hopcroft(g))
print(l)

Prologue序幕

I am highly recommend you to read ANY good programming book that explains algorithms.强烈建议您阅读任何解释算法的优秀编程书籍。 Your problem can be solved with literally several lines of code.你的问题可以用几行代码来解决。

Act 1第一幕

Look at your problem.看看你的问题。 You have several professions, several age clusters and names as unique identifiers.您有多个职业、多个年龄组和名称作为唯一标识符。 And you want to differ them from each other.并且您想将它们彼此区别开来。 Now look at your code.现在看看你的代码。 For solving your problem, you are creating the unique list for every age-profession combination.为了解决您的问题,您正在为每个年龄-职业组合创建唯一的列表。 It is the least modifiable structure that can be created.它是可以创建的最少可修改的结构。 If you will have to add another five professions (there are thousands of various professions), you will have to literally double your code.如果您必须添加另外五个职业(有数千种不同的职业),您将不得不将代码加倍。 Moreover, you can easily make an error while copy-pasting.此外,您在复制粘贴时很容易出错。 Just an ordinary merchandiser3 in the place of merchandiser4 can turn your next one-two hours in the red-eye hell.只需一个普通的merchandiser3代替merchandiser4就可以让您在接下来的一两个小时内陷入红眼地狱。 Look, you are already have an error inside your code!看,您的代码中已经有错误了!

if (line['profession'] == 'doctor' and line['age'] >= '13' and line['age'] <= '17'):
    doctors1.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '18' and line['age'] <= '29'):
    doctors2.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '30' and line['age'] <= '49'):
    doctors3.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '50' and line['age'] <= '64'):
    # Hello, guys! I am ready to torture his brain and eyes for hours!!
    drivers4.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '65'):
    doctors5.append(line['name'])

And, as final shot in the head, you don't really need all these lists.而且,作为头脑中的最后一击,您真的不需要所有这些列表。 You can create, for example, a dict for each profession.例如,您可以为每个职业创建一个字典。 Or something else.或者是其他东西。 But you can note that your data has very-very recurrent pattern for every human.但是您可以注意到,您的数据对于每个人都具有非常重复的模式。 Name, age, profession... Wait, where were we took the data?姓名、年龄、职业……等等,我们把数据带到哪里去了? CSV file? CSV 文件? And what is CSV file?什么是 CSV 文件?

Yes.是的。

Table.桌子。

Act 2法案 2

If you read data from the table, it is good idea to store this data in the table!如果从表中读取数据,最好将此数据存储在表中! (Well, most of time...) Python has an amazing library for tables - Pandas. (嗯,大部分时间......)Python 有一个惊人的表格库 - Pandas。 All your hundreds of lines can be reduced to one-two dozens!您所有的数百行都可以减少到一二打! Now look closely at my hands, the magic begins...现在仔细看看我的手,魔法开始了……

Zero.零。 We import Pandas:我们进口熊猫:

import pandas as pd

First.第一的。 We create the separate function for age clustering.我们为年龄聚类创建了单独的函数。 If our Big Boss will say us to handle 11-year-old neuroscienticts, we will be in full readiness:如果我们的大老板让我们处理 11 岁的神经科学家,我们将做好充分准备:

def get_age_cluster(age):
    a = int(age)
    if a >= 0 and a <= 12:
        return '<13'
    if a >= 13 and a <= 17:
        return '13-17'
    if a >= 18 and a <= 29:
        return '18-29'
    if a >= 30 and a <= 49:
        return '30-49'
    if a >= 50 and a <= 64:
        return '50-64'
    elif a >= 65:
        return '>64'

Second.第二。 We read the CSV.我们阅读了 CSV。 You are doing it manually, line over line, processing each possible combination... Why?!您正在手动进行,逐行处理,处理每种可能的组合......为什么?! It is a common operaion!这是一个常见的操作! People already wrote it long time ago!早就有人写了! Be lazy!偷懒!

(It is the advice from my old sensei that I store in my heart in years! Joke. I have no heart.) (这是我老老师的忠告,我多年藏在心里!笑话。我没有心。)

df=pd.read_csv('TF.csv')

Yes, it is all.是的,这就是全部。 Yes.是的。 Really.真的。 One line.一条线。 Twenty four symbols (remember this number!!).二十四个符号(记住这个数字!!)。 Now let's become friends with our ten little cuties:现在让我们和我们的十个小可爱成为朋友:

==============================

We just loaded the CSV, but we didn't transform the age column.我们刚刚加载了 CSV,但我们没有转换age列。 It contains ages, but should contain clusters.它包含年龄,但应该包含集群。 Not a problem!不是问题!

df['age'] = df['age'].apply(get_age_cluster)

Done!完毕! You can apply any transform functions to the rows or columns in the table.您可以将任何转换函数应用于表中的行或列。 So we don't need to sort ages and sort ages and sotr ages and sort aegs and... We can just write a beautiful one-liner.因此,我们不需要对年龄进行排序、对年龄进行排序、对年龄进行排序、对 aegs 进行排序以及...我们只需编写一个漂亮的单行。 Here is the result:结果如下:

================================

You can note that we have some garbage columns.您可以注意到我们有一些垃圾列。 Not a problem!不是问题!

df = df.drop('waka', axis=1) df = df.drop('we_dont_need_this_column', axis=1) df = df.drop('waka', axis=1) df = df.drop('we_dont_need_this_column', axis=1)

And we have a small beautiful table:我们有一张漂亮的小桌子:

========================

Now to the main task.现在进入主要任务。 Get all names according to every profession and age.根据每个职业和年龄获取所有名称。 Pandas has many-many features with grouping. Pandas 具有许多分组功能。 Let's use the simpliest:让我们使用最简单的:

grouped = df.groupby(['profession', 'age'])
for group in grouped.groups:
    print(group, list(grouped.get_group(group)['name']))

We get grouped-structure with profession-age groups: grouped = df.groupby(['profession', 'age']) , and for every group in this structure: for group in grouped.groups: we print: print() the list of the column 'name' in each group: grouped.get_group(group)['name']) .我们得到带有职业年龄组的分组结构: grouped = df.groupby(['profession', 'age']) ,对于这个结构中的每个组: for group in grouped.groups:我们打印: print()每个组中列 'name' 的列表: grouped.get_group(group)['name']) And here is the result:结果如下:

('eng', '30-49') ['Cthulhu']
('driver', '18-29') ['John Doe 3']
('actor', '13-17') ['John Doe 4']
('actor', '18-29') ['Yog-Sothoth']
('teacher', '18-29') ['John Doe 2', 'Shub-Niggurath']
('eng', '>64') ['Fblthp the Lost']
('driver', '<13') ['Azathoth']
('doctor', '18-29') ['Nyarlathotep']
('doctor', '30-49') ['John Doe 1']

And here is the whole code:这是整个代码:

import pandas as pd

def get_age_cluster(age):
    a = int(age)
    if a >= 0 and a <= 12:
        return '<13'
    if a >= 13 and a <= 17:
        return '13-17'
    if a >= 18 and a <= 29:
        return '18-29'
    if a >= 30 and a <= 49:
        return '30-49'
    if a >= 50 and a <= 64:
        return '50-64'
    elif a >= 65:
        return '>64'

df=pd.read_csv('TF.csv')
df['age'] = df['age'].apply(get_age_cluster)
df = df.drop('waka', axis=1)
df = df.drop('we_dont_need_this_column', axis=1)
grouped = df.groupby(['profession', 'age'])
for group in grouped.groups:
    print(group, list(grouped.get_group(group)['name']))

Twenty four lines.二十四行。 I think we can call ourselves Fantastic Twenty Four now.我想我们现在可以称自己为神奇二十四了。 It is like Fantastic Four, but Fantastic Twenty Four.它就像神奇四侠,但神奇二十四。 But we still have our Graph Doom alive...但是我们的 Graph Doom 仍然存在……

Act 3第 3 条

We created the table, did some transformations, sorted and filtered it.我们创建了表格,做了一些转换,对它进行了排序和过滤。 But you have another problem - The Graph.但是你还有另一个问题——图表。 And this problem is harder than the first.而这个问题比第一个更难。

You are reading nodes (humans) and edges (I don't know what exactly. Relations?) from the one file.您正在从一个文件中读取节点(人类)和边缘(我不知道究竟是什么。关系?)。 It forces your graph to have strong limitation - the number of nodes is equal to the number of edges.它迫使你的图有很强的限制——节点数等于边数。 It is very rare case.这是非常罕见的情况。 I think you did something wrong just before you started to write this script.我认为您在开始编写此脚本之前做错了什么。 I recommend you to have different files (or, at least, different sections in one file) for nodes and edges.我建议您为节点和边使用不同的文件(或至少在一个文件中使用不同的部分)。 But!但! Let's suppose that you are doing exactly what you want and every human (and Cthulhu too, of course!) has only one edge.让我们假设你正在做你想做的事,每个人(当​​然还有克苏鲁!)只有一个优势。 In this case we can construct our graph with only two lines of code:在这种情况下,我们可以只用两行代码构建我们的图:

G = nx.Graph()
G.add_edges_from(df[['first', 'second']].values)

Bingo!答对了! We are done.我们完了。 Now let's get this strange complicated thing:现在让我们得到这个奇怪的复杂的东西:

Set the community of each node (NOTE THAT YOU NEED IT FOR THE ALGORITHM):设置每个节点的社区(注意,您需要它用于算法):

for n in G.nodes:
    G.nodes[n]['community'] = 0

And calculate THIS:并计算这个:

csh = nx.cn_soundarajan_hopcroft(G)

And we get an iterator.我们得到了一个迭代器。 Convert it to list and get the result:将其转换为列表并得到结果:

[(1, 8, 2),
 (1, 9, 0),
 (1, 2, 4),
 (1, 4, 0),
 (1, 6, 2),
 (2, 8, 2),
 (2, 9, 2),
 (2, 5, 0),
 (2, 6, 2),
 (3, 9, 0),
 (3, 4, 2),
 (3, 5, 2),
 (3, 6, 0),
 (3, 7, 4),
 (4, 8, 0),
 (4, 5, 2),
 (4, 7, 2),
 (5, 8, 0),
 (5, 9, 0),
 (5, 7, 2),
 (6, 8, 0),
 (6, 9, 2),
 (6, 7, 0),
 (7, 8, 0),
 (7, 9, 0),
 (8, 9, 0)]

Grand Finale总决赛

I hope you like my little music piece that I wrote for you :) I recommend you to write some good Python programming book and algorithms programming book.我希望你喜欢我为你写的小音乐 :) 我推荐你写一些好的 Python 编程书和算法编程书。 Good luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM