I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. For each node my code returns me a list, b
containing integer values giving me the path length for every possible connection. I want to count the number of occurences of given path lengths so I can create a histogram.
local_path_length_hist = {}
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
a = dist.a
#Delete some erroneous entries
b = a[a!=2147483647]
for dist in b:
if dist in local_path_length_hist:
local_path_length_hist[dist]+=1
else:
local_path_length_hist[dist]=1
This presumably is very crude coding as far as the dictionary update is concerned. Is there a better way of doing this? What is the most efficient way of creating this histogram?
The check that element exists in dict
is not really necessary. You can just use collections.defaultdict
. Its initialization accepts callable object (like function) that will be called if you want to access (or assign something to) element that does not exist to generate the value (ie function that generates default value). For your case, it can be just int
. Ie
import collections
local_path_length_hist = collections.defaultdict(int)
# you could say collections.defaultdict(lambda : 0) instead
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
a = dist.a
#Delete some erroneous entries
b = a[a!=2147483647]
for dist in b:
local_path_length_hist[dist] += 1
You could turn the last two lines in one like that, but there is really no point.
Since gt.shortest_distance
returns an ndarray
, numpy
math is fastest:
max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length)
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
hist += np.bincount(dist.a.clip(max=no_path_dist))
I use the ndarray
method clip
to bin the 2147483647
values returned by gt.shortest_distance
at the last position of hist
. Without use of clip
, hist's
size
would have to be 2147483647 + 1
on 64-bit Python, or bincount
would produce a ValueError
on 32-bit Python. So the last position of hist
will contain a count of all non-paths; you can ignore this value in your histogram analysis.
As the below timings indicate, using numpy
math to obtain a histogram is well over an order of magnitude faster than using either defaultdicts
or counters
(Python 3.4):
# vertices numpy defaultdict counter
9000 0.83639 38.48990 33.56569
25000 8.57003 314.24265 262.76025
50000 26.46427 1303.50843 1111.93898
My computer is too slow to test with 9 * (10**6)
vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect).
timing code :
from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat
# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
rand_dist = randint(0,(num_vertices-1))
distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)
def use_numpy_math():
max_dist = num_vertices - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length, dtype=np.int)
for _ in range(num_vertices):
hist += np.bincount(dist_a.clip(max=no_path_dist))
def use_default_dict():
d = defaultdict(int)
for _ in range(num_vertices):
for dist in dist_a:
d[dist] += 1
def use_counter():
hist = Counter()
for _ in range(num_vertices):
hist.update(dist_a)
t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
repeat= 3, number=1))
print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))
There is a utility in the collections
module called Counter
. This is even cleaner than using a defaultdict(int)
from collections import Counter
hist = Counter()
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
a = dist.a
#Delete some erroneous entries
b = a[a!=2147483647]
hist.update(b)
I think you can bypass this code entirely. Your question is tagged with graph-tool . Take a look at this section of their documentation: graph_tool.stats.vertex_hist .
Excerpt from linked documentation:
graph_tool.stats.vertex_hist(g, deg, bins=[0, 1], float_count=True)
Return the vertex histogram of the given degree type or property.Parameters:
g : Graph Graph to be used.
deg : string or PropertyMap
Degree or property to be used for the histogram. It can be either “in”, “out” or “total”, for in-,
out-, or total degree of the vertices. It can also be a vertex property map.
bins : list of bins (optional, default: [0, 1])
List of bins to be used for the histogram. The values given represent the edges of the bins
(ie lower and upper bounds). If the list contains two values, this will be used to automatically
create an appropriate bin range, with a constant width given by the second value, and starting
from the first value.
float_count : bool (optional, default: True)
If True, the counts in each histogram bin will be returned as floats. If False, they will be
returned as integers.Returns: counts : ndarray
The bin counts.
bins : ndarray
The bin edges.
This will return the edges grouped like a histogram in an ndarray
. You can then just get the length of the ndarray
columns to get your counts to generate the histogram.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.