简体   繁体   English

迭代计算列表中的元素并将其存储在字典中

[英]Iteratively count elements in list and store count in dictionary

I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. 我有一段代码循环通过一组节点,并计算将给定节点连接到网络中每个其他节点的路径长度。 For each node my code returns me a list, b containing integer values giving me the path length for every possible connection. 对于每个节点,我的代码都会向我返回一个列表, b包含整数值,该整数值为我提供了每个可能连接的路径长度。 I want to count the number of occurences of given path lengths so I can create a histogram. 我想计算给定路径长度的出现次数,以便创建直方图。

local_path_length_hist = {}
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        if dist in local_path_length_hist:
            local_path_length_hist[dist]+=1
        else:
            local_path_length_hist[dist]=1

This presumably is very crude coding as far as the dictionary update is concerned. 就字典更新而言,这大概是非常粗糙的编码。 Is there a better way of doing this? 有更好的方法吗? What is the most efficient way of creating this histogram? 创建此直方图的最有效方法是什么?

The check that element exists in dict is not really necessary. 确实不需要检查dict是否存在元素。 You can just use collections.defaultdict . 您可以只使用collections.defaultdict Its initialization accepts callable object (like function) that will be called if you want to access (or assign something to) element that does not exist to generate the value (ie function that generates default value). 它的初始化接受可调用对象(如函数),如果您要访问(或分配一些东西)不存在的元素以生成值(即生成默认值的函数),则该对象将被调用。 For your case, it can be just int . 对于您的情况,它可以只是int Ie

import collections
local_path_length_hist = collections.defaultdict(int)
# you could say collections.defaultdict(lambda : 0) instead
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        local_path_length_hist[dist] += 1

You could turn the last two lines in one like that, but there is really no point. 您可以将最后两行变成这样,但实际上没有意义。

Since gt.shortest_distance returns an ndarray , numpy math is fastest: 由于gt.shortest_distance返回ndarray ,因此numpy数学运算最快:

max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length) 
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    hist += np.bincount(dist.a.clip(max=no_path_dist))

I use the ndarray method clip to bin the 2147483647 values returned by gt.shortest_distance at the last position of hist . 我使用ndarray方法clipndarray返回的2147483647gt.shortest_distancehist的最后一个位置。 Without use of clip , hist's size would have to be 2147483647 + 1 on 64-bit Python, or bincount would produce a ValueError on 32-bit Python. 如果不使用clip ,则在64位Python上hist's size必须为2147483647 + 1 ,否则bincount在32位Python上会产生ValueError So the last position of hist will contain a count of all non-paths; 因此hist的最后一个位置将包含所有非路径的计数; you can ignore this value in your histogram analysis. 您可以在直方图分析中忽略此值。


As the below timings indicate, using numpy math to obtain a histogram is well over an order of magnitude faster than using either defaultdicts or counters (Python 3.4): 如下所示,使用numpy数学获取直方图比使用defaultdictscounters (Python 3.4)快一个数量级:

# vertices      numpy    defaultdict    counter
    9000       0.83639    38.48990     33.56569
   25000       8.57003    314.24265    262.76025
   50000      26.46427   1303.50843   1111.93898

My computer is too slow to test with 9 * (10**6) vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect). 我的计算机太慢了,无法测试9 * (10**6)个顶点,但是相对时间似乎对于变化数量的顶点来说是相当一致的(正如我们期望的那样)。


timing code : 计时码

from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat

# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
    rand_dist = randint(0,(num_vertices-1))
    distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)

def use_numpy_math():
    max_dist = num_vertices - 1
    hist_length = max_dist + 2
    no_path_dist = max_dist + 1
    hist = np.zeros(hist_length, dtype=np.int)
    for _ in range(num_vertices):
        hist += np.bincount(dist_a.clip(max=no_path_dist))

def use_default_dict():
    d = defaultdict(int)
    for _ in range(num_vertices):
        for dist in dist_a:
            d[dist] += 1

def use_counter():
    hist = Counter()
    for _ in range(num_vertices):
        hist.update(dist_a)

t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
                repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
                repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
                repeat= 3, number=1))

print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))

There is a utility in the collections module called Counter . collections模块中有一个称为Counter的实用程序。 This is even cleaner than using a defaultdict(int) 这比使用defaultdict(int)更干净

from collections import Counter
hist = Counter()
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    hist.update(b)

I think you can bypass this code entirely. 我认为您可以完全绕过此代码。 Your question is tagged with . 您的问题用标记。 Take a look at this section of their documentation: graph_tool.stats.vertex_hist . 看看他们的文档的这一部分: graph_tool.stats.vertex_hist

Excerpt from linked documentation: 摘自链接文档:

graph_tool.stats.vertex_hist(g, deg, bins=[0, 1], float_count=True) graph_tool.stats.vertex_hist(g,deg,bins = [0,1],float_count = True)
Return the vertex histogram of the given degree type or property. 返回给定度数类型或属性的顶点直方图。

Parameters: 参数:
g : Graph Graph to be used. g:图形所使用的图形。
deg : string or PropertyMap deg:字符串或PropertyMap
Degree or property to be used for the histogram. 用于直方图的度数或属性。 It can be either “in”, “out” or “total”, for in-, 它可以是“ in”,“ out”或“ total”,对于in-,
out-, or total degree of the vertices. 顶点的总度数。 It can also be a vertex property map. 它也可以是顶点属性图。
bins : list of bins (optional, default: [0, 1]) bins:bin列表(可选,默认值:[0,1])
List of bins to be used for the histogram. 直方图要使用的bin列表。 The values given represent the edges of the bins 给定的值代表垃圾箱的边缘
(ie lower and upper bounds). (即上下限)。 If the list contains two values, this will be used to automatically 如果列表中包含两个值,它将被用于自动
create an appropriate bin range, with a constant width given by the second value, and starting 创建一个适当的bin范围,其宽度由第二个值给定,然后开始
from the first value. 从第一个值开始。
float_count : bool (optional, default: True) float_count:bool(可选,默认:True)
If True, the counts in each histogram bin will be returned as floats. 如果为True,则每个直方图bin中的计数将以浮点数形式返回。 If False, they will be 如果为False,它们将是
returned as integers. 以整数形式返回。

Returns: counts : ndarray 返回:计数:ndarray
The bin counts. 垃圾箱计数。
bins : ndarray 箱:ndarray
The bin edges. 垃圾箱边缘。

This will return the edges grouped like a histogram in an ndarray . 这将返回像ndarray的直方图一样分组的边缘。 You can then just get the length of the ndarray columns to get your counts to generate the histogram. 然后,您只需获取ndarray列的长度即可获取计数以生成直方图。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM