简体   繁体   English

在Python中将图遍历转换为多处理

[英]Converting graph traversal to multiprocessing in Python

I've been working on a graph traversal algorithm over a simple network and I'd like to run it using multiprocessing since it it going to require a lot of I/O bounded calls when I scale it over the full network. 我一直在研究简单网络上的图遍历算法,我想使用多处理来运行它,因为当我在整个网络上扩展它时,它将需要大量的I / O绑定调用。 The simple version runs pretty fast: 简单版本运行速度非常快:

already_seen = {}
already_seen_get = already_seen.get

GH_add_node = GH.add_node
GH_add_edge = GH.add_edge
GH_has_node = GH.has_node
GH_has_edge = GH.has_edge


def graph_user(user, depth=0):
    logger.debug("Searching for %s", user)
    logger.debug("At depth %d", depth)
    users_to_read = followers = following = []

    if already_seen_get(user):
        logging.debug("Already seen %s", user)
        return None

    result = [x.value for x in list(view[user])]

    if result:
        result = result[0]
        following = result['following']
        followers = result['followers']
        users_to_read = set().union(following, followers)

    if not GH_has_node(user):
        logger.debug("Adding %s to graph", user)
        GH_add_node(user)

    for follower in users_to_read:
        if not GH_has_node(follower):
            GH_add_node(follower)
            logger.debug("Adding %s to graph", follower)
            if depth < max_depth:
                graph_user(follower, depth + 1)

        if GH_has_edge(follower, user):
            GH[follower][user]['weight'] += 1
        else:
            GH_add_edge(user, follower, {'weight': 1})

Its actually significantly faster than my multiprocessing version: 它实际上比我的多处理版本快得多:

to_write = Queue()
to_read = Queue()
to_edge = Queue()
already_seen = Queue()


def fetch_user():
    seen = {}
    read_get = to_read.get
    read_put = to_read.put
    write_put = to_write.put
    edge_put = to_edge.put
    seen_get = seen.get

    while True:
        try:
            logging.debug("Begging for a user")

            user = read_get(timeout=1)
            if seen_get(user):
                continue

            logging.debug("Adding %s", user)
            seen[user] = True
            result = [x.value for x in list(view[user])]
            write_put(user, timeout=1)

            if result:
                result = result.pop()
                logging.debug("Got user %s and result %s", user, result)
                following = result['following']
                followers = result['followers']
                users_to_read = list(set().union(following, followers))

                [edge_put((user, x, {'weight': 1})) for x in users_to_read]

                [read_put(y, timeout=1) for y in users_to_read if not seen_get(y)]

        except Empty:
            logging.debug("Fetches complete")
            return


def write_node():
    users = []
    users_app = users.append
    write_get = to_write.get

    while True:
        try:
            user = write_get(timeout=1)
            logging.debug("Writing user %s", user)
            users_app(user)
        except Empty:
            logging.debug("Users complete")
            return users


def write_edge():
    edges = []
    edges_app = edges.append
    edge_get = to_edge.get

    while True:
        try:
            edge = edge_get(timeout=1)
            logging.debug("Writing edge %s", edge)
            edges_app(edge)
        except Empty:
            logging.debug("Edges Complete")
            return edges


if __name__ == '__main__':
    pool = Pool(processes=1)
    to_read.put(me)

    pool.apply_async(fetch_user)
    users = pool.apply_async(write_node)
    edges = pool.apply_async(write_edge)

    GH.add_weighted_edges_from(edges.get())
    GH.add_nodes_from(users.get())

    pool.close()
    pool.join()

What I can't figure out is why the single process version is so much faster. 我不知道为什么单进程版本要快得多。 In theory, the multiprocessing version should be writing and reading simultaneously. 从理论上讲,多处理版本应同时读写。 I suspect there is lock contention on the queues and that is the cause of the slow down but I don't really have any evidence of that. 我怀疑队列中有锁争用,这是速度变慢的原因,但我真的没有任何证据。 When I scale the number of fetch_user processes it seems to run faster, but then I have issues with synchronizing the data seen across them. 当我缩放fetch_user进程的数量时,它似乎运行得更快,但是随后出现了同步在它们之间看到的数据的问题。 So some thoughts I've had are 所以我有一些想法

  • Is this even a good application for multiprocessing? 这甚至是多处理的好应用程序吗? I was originally using it because I wanted to be able to fetch from the db in parallell. 我最初使用它是因为我希望能够以并行方式从数据库中获取。
  • How can I avoid resource contention when reading and writing from the same queue? 从同一队列读写时如何避免资源争用?
  • Did I miss some obvious caveat for the design? 我是否错过了一些明显的设计注意事项?
  • What can I do to share a lookup table between the readers so I don't keep fetching the same user twice? 我该怎么做才能在阅读器之间共享查找表,这样我就不会继续两次获取同一用户?
  • When increasing the number of fetching processes they writers eventually lock. 当增加获取过程的数量时,编写者最终将锁定。 It looks like the write queue is not being written to, but the read queue is full. 似乎未写入写入队列,但读取队列已满。 Is there a better way to handle this situation than with timeouts and exception handling? 有没有比超时和异常处理更好的方法来处理这种情况?

Queues in Python are synchronized. Python中的队列是同步的。 This means that only one thread at a time can read/write, this will definitely provoke a bottleneck in your app. 这意味着一次只能读/写一个线程,这肯定会在您的应用程序中引起瓶颈。

One better solution is to distribute the processing based on a hash function and assign the processing to the threads with a simple module operation. 一种更好的解决方案是基于哈希函数分配处理,并通过简单的模块操作将处理分配给线程。 So for instance if you have 4 threads you could have 4 queues: 因此,例如,如果您有4个线程,则可能有4个队列:

 thread_queues = []
 for i in range(4):
     thread_queues = Queue()

 for user in user_list:
    user_hash=hash(user.user_id) #hash in here is just shortcut to some standard hash utility 
    thread_id = user_hash % 4
    thread_queues[thread_id].put(user)

 # From here ... your pool of threads access thread_queues but each thread ONLY accesses 
 # one queue based on a numeric id given to each of them.

Most of hash functions will distribute evenly your data. 大多数哈希函数将平均分配您的数据。 I normally use UMAC. 我通常使用UMAC。 But maybe you can just try with the hash function from the Python String implementation. 但是也许您可以尝试使用Python String实现中的哈希函数。

Another improvement would be to avoid the use of Queues and use a non-sync object, such a list. 另一个改进是避免使用队列,而使用非同步对象(例如列表)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM