Test/Train data set for Graph Network

Question

I have a graphical network that I am creating as follows:

g=nx.read_edgelist(data, create_using=nx.Graph())

I am trying to create a test and train set for the data. I tried using the below command:

train, test = train_test_split(g, test_size=0.2)

but this did not work. Can you please advise how I am suppose to create a test and train set when I have a graphical network.

Answer 1

Depending on your task, you can have a try with Stellargraph's EdgeSplitter class( docs ) and scikit-learn's train_test_split function ( docs ) to do this.

Node classification

If your task is a node classification task, this Node classification with Graph Convolutional Network (GCN) is a good example of how to load data and do train-test-split. It took Cora dataset as an example. The most important steps are the following:

dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()

train_subjects, test_subjects = model_selection.train_test_split(
    node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
    test_subjects, train_size=500, test_size=None, stratify=test_subjects
)

train_gen = generator.flow(train_subjects.index, train_targets)
val_gen = generator.flow(val_subjects.index, val_targets)
test_gen = generator.flow(test_subjects.index, test_targets)

Basically, it's the same as train-test-split with a normal classification task, except what we split here is nodes.

Edge classification

If your task is edge classification, you could have a look at this Link prediction example: GCN on the Cora citation dataset . The most relevant code for train-test-split is

# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)

# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
    p=0.1, method="global", keep_connected=True
)

# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)

# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
    p=0.1, method="global", keep_connected=True
)

# For training we create a generator on the G_train graph, and make an 
# iterator over the training links using the generator’s flow() method:

train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)

Here the splitting algorithm behind EdgeSplitter class( docs ) is more complex, it needs to maintain the graph structure while doing the split, such as keeping the graph connectivity for example. For more details, cf source code for EdgeSplitter

Test/Train data set for Graph Network

Question

1 answers

solution1
0 2021-10-17 17:10:59

Node classification

Edge classification

Test/Train data set for Graph Network

Question

1 answers

solution1 0 2021-10-17 17:10:59

Node classification

Edge classification

solution1
0 2021-10-17 17:10:59