[英]How to select a good sample size of nodes from a graph
我有一個節點屬性標記為 0 或 1 的網絡。我想找出具有相同屬性的節點之間的距離與具有不同屬性的節點之間的距離有何不同。 由於在計算上很難找到所有節點組合之間的距離,我想 select 節點的樣本大小。 我將如何 select 節點的樣本大小? 我正在研究 python 和 networkx
你沒有提供很多細節,所以我會發明一些數據並做出假設,希望它有用。
首先導入包並對數據集進行采樣:
import random
import networkx as nx
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i, attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
接下來,計算隨機節點對之間的最短路徑:
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a, b = random.sample(list(G.nodes), 2)
try:
n = nx.algorithms.shortest_path_length(G, a, b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a, b, n))
最后,這里有一些代碼來總結結果並打印出來:
from collections import Counter
from scipy import stats
# somewhere to counts of both 0, both 1, different labels
c_0 = Counter()
c_1 = Counter()
c_d = Counter()
# accumulate distances into the above counters
node_data = {i: a['label'] for i, a in G.nodes.data()}
cc = { (0,0): c_0, (0,1): c_d, (1,0): c_d, (1,1): c_1 }
for a, b, n in results:
cc[node_data[a], node_data[b]][n] += 1
# code to display the results nicely
def show(c, title):
s = sum(c.values())
print(f'{title}, n={s}')
for k, n in sorted(c.items()):
# calculate some sort of CI over monte carlo error
lo, hi = stats.beta.ppf([0.025, 0.975], 1 + n, 1 + s - n)
print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%}, {hi:6.2%}]')
show(c_0, 'both 0')
show(c_1, 'both 1')
show(c_d, 'different')
上面打印出來:
both 0, n=63930
-1: 60806 = 95.11% [94.94%, 95.28%]
1: 107 = 0.17% [ 0.14%, 0.20%]
2: 753 = 1.18% [ 1.10%, 1.26%]
3: 1137 = 1.78% [ 1.68%, 1.88%]
4: 584 = 0.91% [ 0.84%, 0.99%]
5: 334 = 0.52% [ 0.47%, 0.58%]
6: 154 = 0.24% [ 0.21%, 0.28%]
7: 50 = 0.08% [ 0.06%, 0.10%]
8: 3 = 0.00% [ 0.00%, 0.01%]
9: 2 = 0.00% [ 0.00%, 0.01%]
both 1, n=3978
-1: 3837 = 96.46% [95.83%, 96.99%]
1: 6 = 0.15% [ 0.07%, 0.33%]
2: 34 = 0.85% [ 0.61%, 1.19%]
3: 34 = 0.85% [ 0.61%, 1.19%]
4: 31 = 0.78% [ 0.55%, 1.10%]
5: 30 = 0.75% [ 0.53%, 1.07%]
6: 6 = 0.15% [ 0.07%, 0.33%]
為了節省空間,我切斷了標簽不同的部分。 方括號中的比例是蒙特卡羅誤差的95% CI 。 使用上面的更多迭代可以減少此錯誤,同時顯然會占用更多 CPU 時間。
這或多或少是我與 Sam Mason 討論的延伸,只想給你一些時間數字,因為正如討論的那樣,檢索所有距離可能是可行的,甚至可能更快。 根據 Sam Mason 回答中的代碼,我測試了這兩種變體,並且檢索所有距離對於 1000 個節點來說比采樣 100 000 對要快得多。 主要優點是使用了所有“檢索距離”。
import random
import networkx as nx
import time
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i, attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
def timing(f):
def wrap(*args, **kwargs):
time1 = time.time()
ret = f(*args, **kwargs)
time2 = time.time()
print('{:s} function took {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
@timing
def get_sample_distance():
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a, b = random.sample(list(G.nodes), 2)
try:
n = nx.algorithms.shortest_path_length(G, a, b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a, b, n))
@timing
def get_all_distances():
all_distances = nx.shortest_path_length(G)
get_sample_distance()
# get_sample_distance function took 2338.038 ms
get_all_distances()
# get_all_distances function took 304.247 ms
``
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.