简体   繁体   中英

How to select a good sample size of nodes from a graph

I have a network that has a node attribute labeled as 0 or 1. I want to find how the distance between nodes with the same attribute differs from the distance between nodes with a different attributes. As it is computationally difficult to find the distance between all combinations of nodes, I want to select a sample size of nodes. How will I select a sample size of nodes? I am working on python and networkx

You've not given many details, so I'll invent some data and make assumptions in the hope it's useful.

Start by importing packages and sampling a dataset:

import random
import networkx as nx

# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)

# set labels to either 0 or 1
for i, attr in G.nodes.data():
    attr['label'] = 1 if random.random() < 0.2 else 0

Next, calculate the shortest paths between random pairs of nodes:

results = []

# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
    a, b = random.sample(list(G.nodes), 2)
    try:
        n = nx.algorithms.shortest_path_length(G, a, b)
    except nx.NetworkXNoPath:
        # no path between nodes found
        n = -1
    results.append((a, b, n))

Finally, here is some code to summarise the results and print them out:

from collections import Counter
from scipy import stats

# somewhere to counts of both 0, both 1, different labels 
c_0 = Counter()
c_1 = Counter()
c_d = Counter()

# accumulate distances into the above counters
node_data = {i: a['label'] for i, a in G.nodes.data()}
cc = { (0,0): c_0, (0,1): c_d, (1,0): c_d, (1,1): c_1 }
for a, b, n in results:
    cc[node_data[a], node_data[b]][n] += 1

# code to display the results nicely
def show(c, title):
    s = sum(c.values())
    print(f'{title},  n={s}')
    for k, n in sorted(c.items()):
        # calculate some sort of CI over monte carlo error
        lo, hi = stats.beta.ppf([0.025, 0.975], 1 + n, 1 + s - n)
        print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%}, {hi:6.2%}]')

show(c_0, 'both 0')
show(c_1, 'both 1')
show(c_d, 'different')

The above prints out:

both 0,  n=63930
   -1: 60806 = 95.11% [94.94%, 95.28%]
    1:   107 =  0.17% [ 0.14%,  0.20%]
    2:   753 =  1.18% [ 1.10%,  1.26%]
    3:  1137 =  1.78% [ 1.68%,  1.88%]
    4:   584 =  0.91% [ 0.84%,  0.99%]
    5:   334 =  0.52% [ 0.47%,  0.58%]
    6:   154 =  0.24% [ 0.21%,  0.28%]
    7:    50 =  0.08% [ 0.06%,  0.10%]
    8:     3 =  0.00% [ 0.00%,  0.01%]
    9:     2 =  0.00% [ 0.00%,  0.01%]

both 1,  n=3978
   -1:  3837 = 96.46% [95.83%, 96.99%]
    1:     6 =  0.15% [ 0.07%,  0.33%]
    2:    34 =  0.85% [ 0.61%,  1.19%]
    3:    34 =  0.85% [ 0.61%,  1.19%]
    4:    31 =  0.78% [ 0.55%,  1.10%]
    5:    30 =  0.75% [ 0.53%,  1.07%]
    6:     6 =  0.15% [ 0.07%,  0.33%]

To save space I've cut off the section where the labels differ. The proportions in the square brackets is the 95% CI of the Monte-Carlo error. Using more iterations above allows you to reduce this error, while obviously taking more CPU time.

This is more or less an extension of my discussion with Sam Mason and only want to give you some timing numbers, because as discussed maybe retrieving all distances is feasible and may even faster. Based on the code in Sam Mason answer, I tested both variants and retrieving all distances is for 1000 nodes much faster than sampling 100 000 pairs. The main advantage is that all "retrieved distances" are used.

import random
import networkx as nx

import time


# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)

# set labels to either 0 or 1
for i, attr in G.nodes.data():
    attr['label'] = 1 if random.random() < 0.2 else 0

def timing(f):
    def wrap(*args, **kwargs):
        time1 = time.time()
        ret = f(*args, **kwargs)
        time2 = time.time()
        print('{:s} function took {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))

        return ret
    return wrap

@timing
def get_sample_distance():
    results = []
    # I had to use 100,000 pairs to get the CI small enough below
    for _ in range(100000):
        a, b = random.sample(list(G.nodes), 2)
        try:
            n = nx.algorithms.shortest_path_length(G, a, b)
        except nx.NetworkXNoPath:
            # no path between nodes found
            n = -1
        results.append((a, b, n))

@timing
def get_all_distances():
    all_distances = nx.shortest_path_length(G)

get_sample_distance()
# get_sample_distance function took 2338.038 ms

get_all_distances()
# get_all_distances function took 304.247 ms
``

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM