简体繁体 English

通过删除节点（顶点）实现几乎集群图的图集群

[英]Graph Clustering for almost Clustered Graph by removing nodes(vertices)

原文 2012-05-26 07:55:03 1 2 graph/ mapreduce/ social-networking/ cluster-analysis/ graph-algorithm

I want to carry out Graph Clustering in a huge undirected graph with millions of edges and nodes. 我想在具有数百万个边和节点的巨大无向图中进行图聚类。 Graph is almost clustered with different clusters joined together only by some nodes(kind of ambiguous nodes which can relate to multiple clusters). 图几乎是由仅由某些节点（可能与多个群集相关的模糊节点）连接在一起的不同群集组成的群集。 There will be very few or almost no edges between two clusters . 两个簇之间几乎没有或几乎没有边缘 。 This problem is almost similar to finding vertex cut set of a graph, with one exception that graph needs to be partitioned into many components(their number being unknown).(Refer this picture https://docs.google.com/file/d/0B7_3zLD0XdtAd3ZwMFAwWDZuU00/edit?pli=1 ) 这个问题几乎与查找图的顶点割集相似，唯一的例外是图需要划分为许多组件（其数量未知）。（请参阅此图片https://docs.google.com/file/d / 0B7_3zLD0XdtAd3ZwMFAwWDZuU00 / edit？pli = 1 ）

Its almost like different strongly connected components sharing a couple of nodes between them and i am supposed to remove those nodes to separate those strongly connected components. 它几乎就像不同的强连接组件在它们之间共享几个节点一样，我应该删除那些节点以分离那些强连接组件。 Edges are weigthed but this problem is more like finding structures in a graph, so edge weigths won't be of relevance. 边被加权了，但是这个问题更像是在图中找到结构，因此边的加权将不相关。 (Another way to think about the problem would be to visualize Solid Spheres touching each other at some points with Spheres being those strongly connected components and touching points being those ambiguous nodes) （考虑该问题的另一种方法是可视化实体球体在某些点相互接触，其中球体是那些紧密连接的组件，而接触点是那些模糊的节点）

I am prototyping something, so am quiet short of time to pick up Graph Clustering Algorithms by myself and to select the best possible. 我正在做一些原型设计，所以请安静一点时间自己挑选图聚类算法并选择最佳方法。 Plus i need a solution that would cut nodes and not edges since different clusters share nodes and not edges in my case. 另外，我需要一种可以削减节点而不减少边缘的解决方案，因为在我的情况下，不同的群集共享节点而不减少边缘。

Is there any research paper, blog that addresses this or somewhat related problem? 是否有研究论文，博客解决了这个问题或一些相关问题？ Or can anyone come up with a solution to this problem howsoever dirty. 或者任何人都可以想出解决此问题的方法，无论它多么肮脏。

Since millions of nodes and edges are involved, i would need a MapReduce implementation of the solution. 由于涉及数百万个节点和边缘，因此我需要该解决方案的MapReduce实现。 Any inputs, links for that too? 任何输入，也有链接吗？

Is there any current open source implementation in MapReduce that can i directly use? 我现在可以直接使用MapReduce中的任何开源实现吗？

I think this problem is analogous to Finding Communities in online social networks by removing vertices. 我认为此问题类似于通过删除顶点在在线社交网络中查找社区。

2 个解决方案

Your problem is not so simple. 您的问题不是那么简单。 I am afraid that it is related to the clique problem, which is NP complete, so unless you quantify somehow the statement "there are almost no edges between the clusters", your problem might be still very difficult. 恐怕它与集团问题有关，NP问题是完整的，因此除非您以某种方式量化“集群之间几乎没有边缘”的陈述，否则您的问题可能仍然非常棘手。 But what I would do in your shoes, would be to try one dirty, greedy approach, namely regarding the nodes as the following kind of quasi-neural net: 但是我会试着尝试一种肮脏，贪婪的方法，即将节点视为以下一种准神经网络：

Each vertex I would consider to have inputs, outputs and a sigmoid activation function which convert the input value (sum of inputs) into the output value. 我认为每个顶点都有输入，输出和一个S型激活函数，可将输入值（输入之和）转换为输出值。 The output value, and I consider this important, would not be cloned and sent to all the neighbors, but rather divided evenly between the neighbors. 我认为这很重要，它的输出值不会被克隆并发送给所有邻居，而是在邻居之间平均分配。 In addition to this, I would define a logarithmic decay of activity in a neuron (self-suppression, suppressive connection to itself), defined by a decay parameter global for the net. 除此之外，我将定义神经元活动的对数衰减（自我抑制，自身抑制连接），该衰减由网络的全局衰减参数定义。

Now, I would start simulation with all the neurons starting from activity 0.5 (activity range is 0 to 1) with very high decay parameter, which would lead to all the neuronst quickly stabilizing in 0 state. 现在，我将开始模拟所有神经元，这些神经元从活动0.5（活动范围为0到1）开始，具有非常高的衰减参数，这将导致所有神经元迅速稳定在0状态。 I would then gradually decrease the decay parameter until the steady state result would yield the first clique with non-zero stable activity. 然后，我将逐渐减小衰减参数，直到稳态结果产生第一个具有非零稳定活动的团。

The question is what to do next. 问题是下一步该怎么做。 One possibility is to subtract the found clique from the graph and run the same process again until we find all the cliques. 一种可能性是从图中减去找到的集团，然后再次运行相同的过程，直到找到所有集团。 This greedy approach might succeed if your graph is indeed as well behaved (really almost clustered) as you say, but might lead to unexpected results otherwise. 如果您的图的行为确实与您所说的一样好（实际上几乎是群集的），那么这种贪婪的方法可能会成功，但否则可能会导致意外的结果。 Another possibility is to give the found clique a unique clique smell that would be repulsive (mutual suppresion) to other cliques an rerun the algorithm until the second clique is found, give it a different clique smell repulsive to all others etc., until each node has its own assigned smell. 另一种可能性是为找到的群体提供独特的群体气味，该气味将排斥其他群体（相互抑制），重新运行该算法，直到找到第二个群体，再赋予其他所有其他群体排斥的群体气味，直到每个节点为止。有自己指定的气味。

I think this would be as many big ideas as i have about this. 我认为这将和我对此有很多大想法。

The key is, that since it is probably not possible to solve this problem in the general case (likely NP complete), you need to take use of whatever special properties your graph has. 关键是，由于在一般情况下（可能是NP完整）可能无法解决此问题，因此您需要利用图形具有的任何特殊属性。 That means you need to play with parameters for a while until the algorithm solves 99% of the cases that you encounter. 这意味着您需要使用一段时间，直到算法解决了您遇到的99％的情况。 I don't think that it is possible to give the numerically precise answer to your question without long experimentation with the actual datasets that you encounter. 我认为，如果不对遇到的实际数据集进行长时间的试验，就不可能给出精确数字答案。

Since millions of nodes and edges are involved, i would need a MapReduce implementation of the solution. 由于涉及数百万个节点和边缘，因此我需要该解决方案的MapReduce实现。 Any inputs, links for that too? 任何输入，也有链接吗？

In my experience I doubt if using Map/Reduce over here would be truly advantageous. 以我的经验，我怀疑在此使用Map / Reduce是否会真正具有优势。 First 10^6 order of nodes isn't really that large [that too in a non hyper-connected graph, since you are considering clustering], and the over head of using Map/Reduce [unless you already have setup your hardware/software for it] for your problem will not be worth it. 前10 ^ 6个节点的顺序实际上并没有那么大（在非超连接图中也是如此，因为您正在考虑聚类），以及使用Map / Reduce的开销[除非您已经设置了硬件/软件为此]，因为您的问题将不值得。

Map/Reduce will work much better, where once you have solved the clustering issue, and then want to process each cluster with similar analysis. Map / Reduce可以更好地工作，一旦您解决了聚类问题，然后希望使用相似的分析来处理每个聚类。 Basically when you can break your task into relatively isolated sub-tasks, which can be performed in parallel. 基本上，当您可以将任务分解为相对独立的子任务时，可以并行执行这些子任务。 This of course can be cascaded to several layers. 当然，这可以级联为几层。

In a relatively similar situation, I personally first modelled my graph into a graph database (I used Neo4J, and would recommend it highly) and then ran my analytic and queries on it. 在相对类似的情况下，我个人首先将图形建模到图形数据库中（我使用Neo4J，并且会强烈推荐它），然后对它进行分析和查询。 You will be surprised as to how white board friendly this solution is, and even massively joined and connected queries will be executed near instantaneously especially at the scale of only a few million nodes. 您会惊讶于该解决方案对白板的友好程度，甚至大规模连接和连接的查询都将几乎立即执行，尤其是在只有几百万个节点的规模上。 For example, you can do a filtered analysis, based on degrees of separation, followed by listing of commons. 例如，您可以根据分离度进行过滤分析，然后列出公用物。