[英]Iteratively count elements in list and store count in dictionary
I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. 我有一段代码循环通过一组节点,并计算将给定节点连接到网络中每个其他节点的路径长度。 For each node my code returns me a list,
b
containing integer values giving me the path length for every possible connection. 对于每个节点,我的代码都会向我返回一个列表,
b
包含整数值,该整数值为我提供了每个可能连接的路径长度。 I want to count the number of occurences of given path lengths so I can create a histogram. 我想计算给定路径长度的出现次数,以便创建直方图。
local_path_length_hist = {}
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
a = dist.a
#Delete some erroneous entries
b = a[a!=2147483647]
for dist in b:
if dist in local_path_length_hist:
local_path_length_hist[dist]+=1
else:
local_path_length_hist[dist]=1
This presumably is very crude coding as far as the dictionary update is concerned. 就字典更新而言,这大概是非常粗糙的编码。 Is there a better way of doing this?
有更好的方法吗? What is the most efficient way of creating this histogram?
创建此直方图的最有效方法是什么?
The check that element exists in dict
is not really necessary. 确实不需要检查
dict
是否存在元素。 You can just use collections.defaultdict
. 您可以只使用
collections.defaultdict
。 Its initialization accepts callable object (like function) that will be called if you want to access (or assign something to) element that does not exist to generate the value (ie function that generates default value). 它的初始化接受可调用对象(如函数),如果您要访问(或分配一些东西)不存在的元素以生成值(即生成默认值的函数),则该对象将被调用。 For your case, it can be just
int
. 对于您的情况,它可以只是
int
。 Ie 即
import collections
local_path_length_hist = collections.defaultdict(int)
# you could say collections.defaultdict(lambda : 0) instead
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
a = dist.a
#Delete some erroneous entries
b = a[a!=2147483647]
for dist in b:
local_path_length_hist[dist] += 1
You could turn the last two lines in one like that, but there is really no point. 您可以将最后两行变成这样,但实际上没有意义。
Since gt.shortest_distance
returns an ndarray
, numpy
math is fastest: 由于
gt.shortest_distance
返回ndarray
,因此numpy
数学运算最快:
max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length)
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
hist += np.bincount(dist.a.clip(max=no_path_dist))
I use the ndarray
method clip
to bin the 2147483647
values returned by gt.shortest_distance
at the last position of hist
. 我使用
ndarray
方法clip
将ndarray
返回的2147483647
值gt.shortest_distance
为hist
的最后一个位置。 Without use of clip
, hist's
size
would have to be 2147483647 + 1
on 64-bit Python, or bincount
would produce a ValueError
on 32-bit Python. 如果不使用
clip
,则在64位Python上hist's
size
必须为2147483647 + 1
,否则bincount
在32位Python上会产生ValueError
。 So the last position of hist
will contain a count of all non-paths; 因此
hist
的最后一个位置将包含所有非路径的计数; you can ignore this value in your histogram analysis. 您可以在直方图分析中忽略此值。
As the below timings indicate, using numpy
math to obtain a histogram is well over an order of magnitude faster than using either defaultdicts
or counters
(Python 3.4): 如下所示,使用
numpy
数学获取直方图比使用defaultdicts
或counters
(Python 3.4)快一个数量级:
# vertices numpy defaultdict counter
9000 0.83639 38.48990 33.56569
25000 8.57003 314.24265 262.76025
50000 26.46427 1303.50843 1111.93898
My computer is too slow to test with 9 * (10**6)
vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect). 我的计算机太慢了,无法测试
9 * (10**6)
个顶点,但是相对时间似乎对于变化数量的顶点来说是相当一致的(正如我们期望的那样)。
timing code : 计时码 :
from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat
# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
rand_dist = randint(0,(num_vertices-1))
distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)
def use_numpy_math():
max_dist = num_vertices - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length, dtype=np.int)
for _ in range(num_vertices):
hist += np.bincount(dist_a.clip(max=no_path_dist))
def use_default_dict():
d = defaultdict(int)
for _ in range(num_vertices):
for dist in dist_a:
d[dist] += 1
def use_counter():
hist = Counter()
for _ in range(num_vertices):
hist.update(dist_a)
t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
repeat= 3, number=1))
print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))
There is a utility in the collections
module called Counter
. collections
模块中有一个称为Counter
的实用程序。 This is even cleaner than using a defaultdict(int)
这比使用
defaultdict(int)
更干净
from collections import Counter
hist = Counter()
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
a = dist.a
#Delete some erroneous entries
b = a[a!=2147483647]
hist.update(b)
I think you can bypass this code entirely. 我认为您可以完全绕过此代码。 Your question is tagged with graph-tool .
您的问题用graph-tool标记。 Take a look at this section of their documentation: graph_tool.stats.vertex_hist .
看看他们的文档的这一部分: graph_tool.stats.vertex_hist 。
Excerpt from linked documentation: 摘自链接文档:
graph_tool.stats.vertex_hist(g, deg, bins=[0, 1], float_count=True)
graph_tool.stats.vertex_hist(g,deg,bins = [0,1],float_count = True)
Return the vertex histogram of the given degree type or property.返回给定度数类型或属性的顶点直方图。
Parameters:
参数:
g : Graph Graph to be used.g:图形所使用的图形。
deg : string or PropertyMapdeg:字符串或PropertyMap
Degree or property to be used for the histogram.用于直方图的度数或属性。 It can be either “in”, “out” or “total”, for in-,
它可以是“ in”,“ out”或“ total”,对于in-,
out-, or total degree of the vertices.顶点的总度数。 It can also be a vertex property map.
它也可以是顶点属性图。
bins : list of bins (optional, default: [0, 1])bins:bin列表(可选,默认值:[0,1])
List of bins to be used for the histogram.直方图要使用的bin列表。 The values given represent the edges of the bins
给定的值代表垃圾箱的边缘
(ie lower and upper bounds).(即上下限)。 If the list contains two values, this will be used to automatically
如果列表中包含两个值,它将被用于自动
create an appropriate bin range, with a constant width given by the second value, and starting创建一个适当的bin范围,其宽度由第二个值给定,然后开始
from the first value.从第一个值开始。
float_count : bool (optional, default: True)float_count:bool(可选,默认:True)
If True, the counts in each histogram bin will be returned as floats.如果为True,则每个直方图bin中的计数将以浮点数形式返回。 If False, they will be
如果为False,它们将是
returned as integers.以整数形式返回。
Returns: counts : ndarray
返回:计数:ndarray
The bin counts.垃圾箱计数。
bins : ndarray箱:ndarray
The bin edges.垃圾箱边缘。
This will return the edges grouped like a histogram in an ndarray
. 这将返回像
ndarray
的直方图一样分组的边缘。 You can then just get the length of the ndarray
columns to get your counts to generate the histogram. 然后,您只需获取
ndarray
列的长度即可获取计数以生成直方图。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.