简体   繁体   English

从 IGraph (Python) 中的多个文件中读取图形

[英]Read Graph from multiple files in IGraph (Python)

I have multiple node- and edgelists which form a large graph, lets call that the maingraph .我有多个 node- 和 edgelists 形成一个大图,我们称之为maingraph My current strategy is to first read all the nodelists and import it with add_vertices .我目前的策略是首先读取所有节点列表并使用add_vertices导入它。 Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change).然后每个节点都会获得一个内部 id,这取决于它们被摄取的顺序,因此不是很可靠(正如我读过的那样,如果你删除一个,所有比删除的更改更高的 id)。 I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.我为每个节点分配一个与我使用的外部 ID 相对应的name属性,以便我可以跟踪框架和type属性之间的节点。

Now, how do I add the edges?现在,我如何添加边缘? When I read an edgelist it will start making a new graph ( subgraph ) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.当我读取 edgelist 时,它会开始制作一个新图( subgraph ),因此内部 ID 从 0 开始。因此,将图与maingraph.add_edges(subgraph.get_edgelist) “合并”不可避免地失败。

It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph :它可以解决这一点,使用name来自两个属性maingraphsubgraph ,找出哪些内部ID每个边缘事件节点在maingraph

def _get_real_source_and_target_id(edge):
    ''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
    maingraph by their name '''
    source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
    target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
    return (source_id,target_id)

And then I tried然后我试过了

edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)

But that is hoooooorribly slow.但这太慢了。 The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach.该图有数百万个节点和边,使用快速但不正确的maingraph.add_edges(subgraph.get_edgelist)方法加载需要 10 秒。 with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so).使用上面解释的正确方法,它需要几分钟(我通常在 5 分钟后停止它)。 I will have to do this tens of thousands of times.我将不得不这样做数万次。 I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.由于加载速度快,我从 NetworkX 切换到 Igraph,但如果我必须这样做,它并没有真正的帮助。

Does anybody have a more clever way to do this?有没有人有更聪明的方法来做到这一点? Any help much appreciated!非常感谢任何帮助!

Thanks!谢谢!

Nevermind, I figured out that the mistake was elsewhere.没关系,我发现错误在别处。 I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here ).我使用numpy.loadtxt()将节点的名称作为字符串读取,当名称增加超过五个数字的数字时,它以某种方式做了有趣的事情(请参阅我的问题报告here )。 Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name.因此,当上面的解决方案试图获取 numpy 弄乱节点名称的节点时,它被卡住了。 maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index在找不到节点时只是坐在那里。 Now I use pandas to read the node names and it works fine.现在我使用熊猫来读取节点名称并且它工作正常。

The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone.上面的解决方案仍然比我以前的 NetworkX 解决方案快约 10 倍,所以我只会让它帮助某人。 Feel free to delete it otherwise.否则随意删除它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM