[英]Converting graph traversal to multiprocessing in Python
I've been working on a graph traversal algorithm over a simple network and I'd like to run it using multiprocessing since it it going to require a lot of I/O bounded calls when I scale it over the full network. 我一直在研究简单网络上的图遍历算法,我想使用多处理来运行它,因为当我在整个网络上扩展它时,它将需要大量的I / O绑定调用。 The simple version runs pretty fast:
简单版本运行速度非常快:
already_seen = {}
already_seen_get = already_seen.get
GH_add_node = GH.add_node
GH_add_edge = GH.add_edge
GH_has_node = GH.has_node
GH_has_edge = GH.has_edge
def graph_user(user, depth=0):
logger.debug("Searching for %s", user)
logger.debug("At depth %d", depth)
users_to_read = followers = following = []
if already_seen_get(user):
logging.debug("Already seen %s", user)
return None
result = [x.value for x in list(view[user])]
if result:
result = result[0]
following = result['following']
followers = result['followers']
users_to_read = set().union(following, followers)
if not GH_has_node(user):
logger.debug("Adding %s to graph", user)
GH_add_node(user)
for follower in users_to_read:
if not GH_has_node(follower):
GH_add_node(follower)
logger.debug("Adding %s to graph", follower)
if depth < max_depth:
graph_user(follower, depth + 1)
if GH_has_edge(follower, user):
GH[follower][user]['weight'] += 1
else:
GH_add_edge(user, follower, {'weight': 1})
Its actually significantly faster than my multiprocessing version: 它实际上比我的多处理版本快得多:
to_write = Queue()
to_read = Queue()
to_edge = Queue()
already_seen = Queue()
def fetch_user():
seen = {}
read_get = to_read.get
read_put = to_read.put
write_put = to_write.put
edge_put = to_edge.put
seen_get = seen.get
while True:
try:
logging.debug("Begging for a user")
user = read_get(timeout=1)
if seen_get(user):
continue
logging.debug("Adding %s", user)
seen[user] = True
result = [x.value for x in list(view[user])]
write_put(user, timeout=1)
if result:
result = result.pop()
logging.debug("Got user %s and result %s", user, result)
following = result['following']
followers = result['followers']
users_to_read = list(set().union(following, followers))
[edge_put((user, x, {'weight': 1})) for x in users_to_read]
[read_put(y, timeout=1) for y in users_to_read if not seen_get(y)]
except Empty:
logging.debug("Fetches complete")
return
def write_node():
users = []
users_app = users.append
write_get = to_write.get
while True:
try:
user = write_get(timeout=1)
logging.debug("Writing user %s", user)
users_app(user)
except Empty:
logging.debug("Users complete")
return users
def write_edge():
edges = []
edges_app = edges.append
edge_get = to_edge.get
while True:
try:
edge = edge_get(timeout=1)
logging.debug("Writing edge %s", edge)
edges_app(edge)
except Empty:
logging.debug("Edges Complete")
return edges
if __name__ == '__main__':
pool = Pool(processes=1)
to_read.put(me)
pool.apply_async(fetch_user)
users = pool.apply_async(write_node)
edges = pool.apply_async(write_edge)
GH.add_weighted_edges_from(edges.get())
GH.add_nodes_from(users.get())
pool.close()
pool.join()
What I can't figure out is why the single process version is so much faster. 我不知道为什么单进程版本要快得多。 In theory, the multiprocessing version should be writing and reading simultaneously.
从理论上讲,多处理版本应同时读写。 I suspect there is lock contention on the queues and that is the cause of the slow down but I don't really have any evidence of that.
我怀疑队列中有锁争用,这是速度变慢的原因,但我真的没有任何证据。 When I scale the number of fetch_user processes it seems to run faster, but then I have issues with synchronizing the data seen across them.
当我缩放fetch_user进程的数量时,它似乎运行得更快,但是随后出现了同步在它们之间看到的数据的问题。 So some thoughts I've had are
所以我有一些想法
Queues in Python are synchronized. Python中的队列是同步的。 This means that only one thread at a time can read/write, this will definitely provoke a bottleneck in your app.
这意味着一次只能读/写一个线程,这肯定会在您的应用程序中引起瓶颈。
One better solution is to distribute the processing based on a hash function and assign the processing to the threads with a simple module operation. 一种更好的解决方案是基于哈希函数分配处理,并通过简单的模块操作将处理分配给线程。 So for instance if you have 4 threads you could have 4 queues:
因此,例如,如果您有4个线程,则可能有4个队列:
thread_queues = []
for i in range(4):
thread_queues = Queue()
for user in user_list:
user_hash=hash(user.user_id) #hash in here is just shortcut to some standard hash utility
thread_id = user_hash % 4
thread_queues[thread_id].put(user)
# From here ... your pool of threads access thread_queues but each thread ONLY accesses
# one queue based on a numeric id given to each of them.
Most of hash functions will distribute evenly your data. 大多数哈希函数将平均分配您的数据。 I normally use UMAC.
我通常使用UMAC。 But maybe you can just try with the hash function from the Python String implementation.
但是也许您可以尝试使用Python String实现中的哈希函数。
Another improvement would be to avoid the use of Queues and use a non-sync object, such a list. 另一个改进是避免使用队列,而使用非同步对象(例如列表)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.