繁体   English   中英

如何 select 从图中获得良好的节点样本大小

[英]How to select a good sample size of nodes from a graph

我有一个节点属性标记为 0 或 1 的网络。我想找出具有相同属性的节点之间的距离与具有不同属性的节点之间的距离有何不同。 由于在计算上很难找到所有节点组合之间的距离,我想 select 节点的样本大小。 我将如何 select 节点的样本大小? 我正在研究 python 和 networkx

你没有提供很多细节,所以我会发明一些数据并做出假设,希望它有用。

首先导入包并对数据集进行采样:

import random
import networkx as nx

# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)

# set labels to either 0 or 1
for i, attr in G.nodes.data():
    attr['label'] = 1 if random.random() < 0.2 else 0

接下来,计算随机节点对之间的最短路径:

results = []

# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
    a, b = random.sample(list(G.nodes), 2)
    try:
        n = nx.algorithms.shortest_path_length(G, a, b)
    except nx.NetworkXNoPath:
        # no path between nodes found
        n = -1
    results.append((a, b, n))

最后,这里有一些代码来总结结果并打印出来:

from collections import Counter
from scipy import stats

# somewhere to counts of both 0, both 1, different labels 
c_0 = Counter()
c_1 = Counter()
c_d = Counter()

# accumulate distances into the above counters
node_data = {i: a['label'] for i, a in G.nodes.data()}
cc = { (0,0): c_0, (0,1): c_d, (1,0): c_d, (1,1): c_1 }
for a, b, n in results:
    cc[node_data[a], node_data[b]][n] += 1

# code to display the results nicely
def show(c, title):
    s = sum(c.values())
    print(f'{title},  n={s}')
    for k, n in sorted(c.items()):
        # calculate some sort of CI over monte carlo error
        lo, hi = stats.beta.ppf([0.025, 0.975], 1 + n, 1 + s - n)
        print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%}, {hi:6.2%}]')

show(c_0, 'both 0')
show(c_1, 'both 1')
show(c_d, 'different')

上面打印出来:

both 0,  n=63930
   -1: 60806 = 95.11% [94.94%, 95.28%]
    1:   107 =  0.17% [ 0.14%,  0.20%]
    2:   753 =  1.18% [ 1.10%,  1.26%]
    3:  1137 =  1.78% [ 1.68%,  1.88%]
    4:   584 =  0.91% [ 0.84%,  0.99%]
    5:   334 =  0.52% [ 0.47%,  0.58%]
    6:   154 =  0.24% [ 0.21%,  0.28%]
    7:    50 =  0.08% [ 0.06%,  0.10%]
    8:     3 =  0.00% [ 0.00%,  0.01%]
    9:     2 =  0.00% [ 0.00%,  0.01%]

both 1,  n=3978
   -1:  3837 = 96.46% [95.83%, 96.99%]
    1:     6 =  0.15% [ 0.07%,  0.33%]
    2:    34 =  0.85% [ 0.61%,  1.19%]
    3:    34 =  0.85% [ 0.61%,  1.19%]
    4:    31 =  0.78% [ 0.55%,  1.10%]
    5:    30 =  0.75% [ 0.53%,  1.07%]
    6:     6 =  0.15% [ 0.07%,  0.33%]

为了节省空间,我切断了标签不同的部分。 方括号中的比例是蒙特卡罗误差的95% CI 使用上面的更多迭代可以减少此错误,同时显然会占用更多 CPU 时间。

这或多或少是我与 Sam Mason 讨论的延伸,只想给你一些时间数字,因为正如讨论的那样,检索所有距离可能是可行的,甚至可能更快。 根据 Sam Mason 回答中的代码,我测试了这两种变体,并且检索所有距离对于 1000 个节点来说比采样 100 000 对要快得多。 主要优点是使用了所有“检索距离”。

import random
import networkx as nx

import time


# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)

# set labels to either 0 or 1
for i, attr in G.nodes.data():
    attr['label'] = 1 if random.random() < 0.2 else 0

def timing(f):
    def wrap(*args, **kwargs):
        time1 = time.time()
        ret = f(*args, **kwargs)
        time2 = time.time()
        print('{:s} function took {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))

        return ret
    return wrap

@timing
def get_sample_distance():
    results = []
    # I had to use 100,000 pairs to get the CI small enough below
    for _ in range(100000):
        a, b = random.sample(list(G.nodes), 2)
        try:
            n = nx.algorithms.shortest_path_length(G, a, b)
        except nx.NetworkXNoPath:
            # no path between nodes found
            n = -1
        results.append((a, b, n))

@timing
def get_all_distances():
    all_distances = nx.shortest_path_length(G)

get_sample_distance()
# get_sample_distance function took 2338.038 ms

get_all_distances()
# get_all_distances function took 304.247 ms
``

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM