简体   繁体   中英

Error in creating graph and performing link prediction using networkx in python

I am trying to make a graph using a csv file which has information about edges and profession and age of the nodes. I assign communities to each node and performing link prediction.

import networkx as nx
import csv
engineers1 = []
engineers2 = []
engineers3 = []
engineers4 = []
engineers5 = []
actors1= []
actors2= []
actors3= []
actors4= []
actors5= []
writers1 = []
writers2= []
writers3= []
writers4 = []
writers5 = []
doctors1= []
doctors2= []
doctors3= []
doctors4= []
doctors5= []
drivers1=[]
drivers2=[]
drivers3=[]
drivers4=[]
drivers5=[]
teachers1=[]
teachers2=[]
teachers3=[]
teachers4=[]
teachers5=[]
nodes=[]
g=nx.Graph()

for i in range(0,4038):
    g.add_node(i)

with open("asd1.csv",'r') as csv_file:
    csv_reader=csv.DictReader(csv_file)

    for line in csv_reader:
        g.add_edge(line['first'],line['second'])

csv_file.close()

with open("asd1.csv",'r') as csv_file:
    csv_reader=csv.DictReader(csv_file)
    for line in csv_reader:
         if (line['profession'] == 'actor' and line['age'] >= '13' and 
line['age'] <= '17'):
            actors1.append(line['name'])
        if (line['profession'] == 'actor' and line['age'] >= '18' and 
line['age'] <= '29'):
          actors2.append(line['name'])
        if (line['profession'] == 'actor' and line['age'] >= '30' and 
line['age'] <= '49'):
        actors3.append(line['name'])
    if (line['profession'] == 'actor' and line['age'] >= '50' and line['age'] <= '64'):
        actors4.append(line['name'])
    if (line['profession'] == 'actor' and line['age'] >= '65'):
        actors5.append(line['name'])

    if (line['profession'] == 'eng' and line['age'] >= '13' and line['age'] <= '17'):
        engineers1.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '18' and line['age'] <= '29'):
        engineers2.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '30' and line['age'] <= '49'):
        engineers3.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '50' and line['age'] <= '64'):
        engineers4.append(line['name'])
    if (line['profession'] == 'eng' and line['age'] >= '65'):
        engineers5.append(line['name'])

    if (line['profession'] == 'teacher' and line['age'] >= '13' and line['age'] <= '17'):
        teachers1.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '18' and line['age'] <= '29'):
        teachers2.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '30' and line['age'] <= '49'):
        teachers3.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '50' and line['age'] <= '64'):
        teachers4.append(line['name'])
    if (line['profession'] == 'teacher' and line['age'] >= '65'):
        teachers5.append(line['name'])

    if (line['profession'] == 'driver' and line['age'] >= '13' and line['age'] <= '17'):
        drivers1.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '18' and line['age'] <= '29'):
        drivers2.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '30' and line['age'] <= '49'):
        drivers3.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '50' and line['age'] <= '64'):
        doctors4.append(line['name'])
    if (line['profession'] == 'driver' and line['age'] >= '65'):
        drivers5.append(line['name'])

    if (line['profession'] == 'doctor' and line['age'] >= '13' and line['age'] <= '17'):
        doctors1.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '18' and line['age'] <= '29'):
        doctors2.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '30' and line['age'] <= '49'):
        doctors3.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '50' and line['age'] <= '64'):
        drivers4.append(line['name'])
    if (line['profession'] == 'doctor' and line['age'] >= '65'):
        doctors5.append(line['name'])

csv_file.close()

print("actors having age between 13 and 17: ",actors1) 
print("actors having age between 18 and 29: ",actors2)
print("actors having age between 30 and 49: ",actors3) 
print("actors having age between 50 and 64: ",actors4)
print("actors having age 65 and above: ",actors5)
print('\n')

print("engineers having age between 13 and 17: ",engineers1)
print("engineers having age between 18 and 29: ",engineers2)
print("engineers having age between 30 and 49: ",engineers3)
print("engineers having age between 50 and 64: ",engineers4)
print("engineers having age 65 and above: ",engineers5)
print('\n')

print("teachers having age between 13 and 17: ",teachers1)
print("teachers having age between 18 and 29: ",teachers2)
print("teachers having age between 30 and 49: ",teachers3)
print("teachers having age between 50 and 64: ",teachers4)
print("teachers having age 65 and above: ",teachers5)
print('\n')

print("drivers having age between 13 and 17: ",drivers1)
print("drivers having age between 18 and 29: ",drivers2)
print("drivers having age between 30 and 49: ",drivers3)
print("drivers having age between 50 and 64: ",drivers4)
print("drivers having age 65 and above: ",drivers5)
print('\n')

print("doctors having age between 13 and 17: ",doctors1)
print("doctors having age between 18 and 29: ",doctors2)
print("doctors having age between 30 and 49: ",doctors3)
print("doctors having age between 50 and 64: ",doctors4)
print("doctors having age 65 and above: ",doctors5)
print('\n')

for i in range(0,4038):
    g.node[i]['community']=0

for x1 in actors1:
    g.node[x1]['community']=0
for x2 in actors2:
    g.node[x2]['community']=1 
for x3 in actors3:
    g.node[x3]['community']=2
for x4 in actors4:
    g.node[x4]['community']=3
for x5 in actors5:
    g.node[x5]['community']=4
for x6 in engineers1:
    g.node[x6]['community']=5
for x7 in engineers2:
    g.node[x7]['community']=6
for x8 in engineers3:
    g.node[x8]['community']=7
for x9 in engineers4:
    g.node[x9]['community']=8
for x10 in engineers5:
    g.node[x10]['community']=9
for x11 in teachers1:
    g.node[x11]['community']=10
for x12 in teachers2:
    g.node[x12]['community']=11
for x13 in teachers3:
    g.node[x13]['community']=12
for x14 in teachers4:
    g.node[x14]['community']=13
for x15 in teachers5:
    g.node[x15]['community']=14
for x16 in drivers1:
    g.node[x16]['community']=15
for x17 in drivers2:
    g.node[x17]['community']=16
for x18 in drivers3:
    g.node[x18]['community']=17
for x19 in drivers4:
    g.node[x19]['community']=18
for x20 in drivers5:
    g.node[x20]['community']=19
for x21 in doctors1:
    g.node[x21]['community']=20
for x22 in doctors2:
   g.node[x22]['community']=21
for x23 in doctors3:
    g.node[x23]['community']=22
for x24 in doctors4:
    g.node[x24]['community']=23
for x25 in doctors5:
    g.node[x25]['community']=24

print(g.nodes())
l=list(nx.cn_soundarajan_hopcroft(g))
print(l)

Prologue

I am highly recommend you to read ANY good programming book that explains algorithms. Your problem can be solved with literally several lines of code.

Act 1

Look at your problem. You have several professions, several age clusters and names as unique identifiers. And you want to differ them from each other. Now look at your code. For solving your problem, you are creating the unique list for every age-profession combination. It is the least modifiable structure that can be created. If you will have to add another five professions (there are thousands of various professions), you will have to literally double your code. Moreover, you can easily make an error while copy-pasting. Just an ordinary merchandiser3 in the place of merchandiser4 can turn your next one-two hours in the red-eye hell. Look, you are already have an error inside your code!

if (line['profession'] == 'doctor' and line['age'] >= '13' and line['age'] <= '17'):
    doctors1.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '18' and line['age'] <= '29'):
    doctors2.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '30' and line['age'] <= '49'):
    doctors3.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '50' and line['age'] <= '64'):
    # Hello, guys! I am ready to torture his brain and eyes for hours!!
    drivers4.append(line['name'])
if (line['profession'] == 'doctor' and line['age'] >= '65'):
    doctors5.append(line['name'])

And, as final shot in the head, you don't really need all these lists. You can create, for example, a dict for each profession. Or something else. But you can note that your data has very-very recurrent pattern for every human. Name, age, profession... Wait, where were we took the data? CSV file? And what is CSV file?

Yes.

Table.

Act 2

If you read data from the table, it is good idea to store this data in the table! (Well, most of time...) Python has an amazing library for tables - Pandas. All your hundreds of lines can be reduced to one-two dozens! Now look closely at my hands, the magic begins...

Zero. We import Pandas:

import pandas as pd

First. We create the separate function for age clustering. If our Big Boss will say us to handle 11-year-old neuroscienticts, we will be in full readiness:

def get_age_cluster(age):
    a = int(age)
    if a >= 0 and a <= 12:
        return '<13'
    if a >= 13 and a <= 17:
        return '13-17'
    if a >= 18 and a <= 29:
        return '18-29'
    if a >= 30 and a <= 49:
        return '30-49'
    if a >= 50 and a <= 64:
        return '50-64'
    elif a >= 65:
        return '>64'

Second. We read the CSV. You are doing it manually, line over line, processing each possible combination... Why?! It is a common operaion! People already wrote it long time ago! Be lazy!

(It is the advice from my old sensei that I store in my heart in years! Joke. I have no heart.)

df=pd.read_csv('TF.csv')

Yes, it is all. Yes. Really. One line. Twenty four symbols (remember this number!!). Now let's become friends with our ten little cuties:

==============================

We just loaded the CSV, but we didn't transform the age column. It contains ages, but should contain clusters. Not a problem!

df['age'] = df['age'].apply(get_age_cluster)

Done! You can apply any transform functions to the rows or columns in the table. So we don't need to sort ages and sort ages and sotr ages and sort aegs and... We can just write a beautiful one-liner. Here is the result:

================================

You can note that we have some garbage columns. Not a problem!

df = df.drop('waka', axis=1) df = df.drop('we_dont_need_this_column', axis=1)

And we have a small beautiful table:

========================

Now to the main task. Get all names according to every profession and age. Pandas has many-many features with grouping. Let's use the simpliest:

grouped = df.groupby(['profession', 'age'])
for group in grouped.groups:
    print(group, list(grouped.get_group(group)['name']))

We get grouped-structure with profession-age groups: grouped = df.groupby(['profession', 'age']) , and for every group in this structure: for group in grouped.groups: we print: print() the list of the column 'name' in each group: grouped.get_group(group)['name']) . And here is the result:

('eng', '30-49') ['Cthulhu']
('driver', '18-29') ['John Doe 3']
('actor', '13-17') ['John Doe 4']
('actor', '18-29') ['Yog-Sothoth']
('teacher', '18-29') ['John Doe 2', 'Shub-Niggurath']
('eng', '>64') ['Fblthp the Lost']
('driver', '<13') ['Azathoth']
('doctor', '18-29') ['Nyarlathotep']
('doctor', '30-49') ['John Doe 1']

And here is the whole code:

import pandas as pd

def get_age_cluster(age):
    a = int(age)
    if a >= 0 and a <= 12:
        return '<13'
    if a >= 13 and a <= 17:
        return '13-17'
    if a >= 18 and a <= 29:
        return '18-29'
    if a >= 30 and a <= 49:
        return '30-49'
    if a >= 50 and a <= 64:
        return '50-64'
    elif a >= 65:
        return '>64'

df=pd.read_csv('TF.csv')
df['age'] = df['age'].apply(get_age_cluster)
df = df.drop('waka', axis=1)
df = df.drop('we_dont_need_this_column', axis=1)
grouped = df.groupby(['profession', 'age'])
for group in grouped.groups:
    print(group, list(grouped.get_group(group)['name']))

Twenty four lines. I think we can call ourselves Fantastic Twenty Four now. It is like Fantastic Four, but Fantastic Twenty Four. But we still have our Graph Doom alive...

Act 3

We created the table, did some transformations, sorted and filtered it. But you have another problem - The Graph. And this problem is harder than the first.

You are reading nodes (humans) and edges (I don't know what exactly. Relations?) from the one file. It forces your graph to have strong limitation - the number of nodes is equal to the number of edges. It is very rare case. I think you did something wrong just before you started to write this script. I recommend you to have different files (or, at least, different sections in one file) for nodes and edges. But! Let's suppose that you are doing exactly what you want and every human (and Cthulhu too, of course!) has only one edge. In this case we can construct our graph with only two lines of code:

G = nx.Graph()
G.add_edges_from(df[['first', 'second']].values)

Bingo! We are done. Now let's get this strange complicated thing:

Set the community of each node (NOTE THAT YOU NEED IT FOR THE ALGORITHM):

for n in G.nodes:
    G.nodes[n]['community'] = 0

And calculate THIS:

csh = nx.cn_soundarajan_hopcroft(G)

And we get an iterator. Convert it to list and get the result:

[(1, 8, 2),
 (1, 9, 0),
 (1, 2, 4),
 (1, 4, 0),
 (1, 6, 2),
 (2, 8, 2),
 (2, 9, 2),
 (2, 5, 0),
 (2, 6, 2),
 (3, 9, 0),
 (3, 4, 2),
 (3, 5, 2),
 (3, 6, 0),
 (3, 7, 4),
 (4, 8, 0),
 (4, 5, 2),
 (4, 7, 2),
 (5, 8, 0),
 (5, 9, 0),
 (5, 7, 2),
 (6, 8, 0),
 (6, 9, 2),
 (6, 7, 0),
 (7, 8, 0),
 (7, 9, 0),
 (8, 9, 0)]

Grand Finale

I hope you like my little music piece that I wrote for you :) I recommend you to write some good Python programming book and algorithms programming book. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM