简体繁体 English

在十亿个节点的无向图中无循环地从正好k个边的源节点中找到目标节点的算法/方法

[英]Algorithm/Approach to find destination node from source node of exactly k edges in undirected graph of billion nodes without cycle

原文 2016-11-18 22:14:41 4 1 python/ algorithm/ graph/ shortest-path/ breadth-first-search

Consider I have an adjacency list of billion of nodes structured using hash table arranged in the following manner: 考虑一下，我有一个以哈希表构造的十亿个节点的邻接列表，哈希表的排列方式如下：

key = source node 键=源节点
value = hash_table { node1, node2, node3} 值= hash_table {node1，node2，node3}

The input values are from text file in the form of 输入值来自文本文件，格式为
from , to 从到
1,2 1,2
1,5 1,5
1,11 1,11
... so on ...等等

eg. 例如。 key = '1' 键='1'
value = {'2','5','11'} 值= {'2'，'5'，'11'}
means 1 is connected to nodes 2,5,11 装置1连接到节点2,5,11

I want to know an algorithm or approach to find destination node from source node of exactly k edges in an undirected graph of billion nodes without cycle 我想知道一种算法或方法，可以从无循环的十亿个节点的无向图中的正好k个边的源节点中找到目标节点

for eg. 例如 from node 1 I want to find node 50 only till depth 3 or till 3 edges. 从节点1我只想找到节点50，直到深度3或直到3个边缘。

My assumption the algorithm finds 1 - 2 - 60 - 50 which is the shortest path but how would the traversing be efficient using the above adjacency list structure? 我的假设是该算法找到最短路径1-2-60-50，但是使用上述邻接表结构如何有效地进行遍历？ I do not want to use Hadoop/Map Reduce. 我不想使用Hadoop / Map Reduce。

I came up with naive solution as below in Python but is not efficient. 我在Python中提出了以下朴素的解决方案，但效率不高。 Only thing is hash table searches key in O(1) so I could just search neighbours and their billion neighbours directly for the key. 唯一的问题是哈希表在O（1）中搜索关键字，因此我可以直接搜索邻居及其十亿个邻居以获取密钥。 The following algo takes lot of time. 以下算法需要很多时间。

start with source node 从源节点开始
use hash table search for finding key 使用哈希表搜索来查找密钥
go 1 level deeper with hash table of neighbor nodes and find their values for destination nodes until node found 使用邻居节点的哈希表更深一层，找到目标节点的值，直到找到节点
Stop if node not found on k depth 如果在k深度上找不到节点，则停止

&nbsp1 ＆nbsp1
| |
{2 5 11} {2 5 11}
| | | | | |
{3,6,7} {nodes} {nodes} .... connected nodes {3,6,7} {nodes} {nodes} ....个连接的节点
| | | | | | | | | |
{nodes} {nodes} {nodes} .... million more connected nodes. {nodes} {nodes} {nodes} ....百万个连接的节点。

Please suggest. 请提出建议。 The algorithm above implemented similar to BFS takes more than 3 hours to search for all the possible key value relationships. 以上类似于BFS实施的算法需要3个多小时才能搜索所有可能的键值关系。 Can be it be reduced with other searching method? 可以用其他搜索方法减少吗？

1 个解决方案

As you've hinted, this will depend a lot on the data access characteristics of your system. 正如您所暗示的，这将在很大程度上取决于系统的数据访问特性。 If you were restricted to single-element accesses, then you'd be truly stuck, as trincot observes. 如Trincot所言 ，如果您仅限于单元素访问，那么您将真正陷入困境。 However, if you can manage block accesses, then you have a chance of parallel operations. 但是，如果您可以管理块访问，那么就有机会进行并行操作。

However, I think that would be outside your control: the hash function owns the adjacency characteristics -- and, in fact, will likely "pessimize" (opposite of "optimize") that characteristic. 但是，我认为这超出了您的控制范围：哈希函数拥有邻接特征-实际上，可能会“悲观”（与“优化”相反）该特征。

I do see one possible hope: use iteration instead of recursion, maintaining a list of nodes to visit. 我确实看到了一个可能的希望：使用迭代而不是递归，维护要访问的节点列表。 When you place a new node on the list, get its hash value. 在列表上放置新节点时，获取其哈希值。 If you can organize the nodes clustered by location, you can perhaps do a block transfer, accessing several values in one read operation. 如果您可以按位置组织群集的节点，则可以进行块传输，一次读取操作即可访问多个值。