简体   繁体   English

简单无向图的表示

[英]Representation of a simple undirected graph

I need your expertise:我需要你的专业知识:

I am about to implement a graph class in c++ and thinking about the right representation.我即将在 c++ 中实现图 class 并考虑正确的表示。 The graphs are simple and undirected.这些图简单且无向。 Number of vertices get for now just up to 1000 but maybe higher in the future.顶点的数量现在最多可达 1000 个,但将来可能会更高。 Number of edges up to 200k and maybe higher.边数高达 200k 甚至更高。 Each vertex got a color (int) and an id (int).每个顶点都有一个颜色(int)和一个id(int)。 Edges transport no more information than connecting to vertices.边传输的信息不比连接到顶点的信息多。

I store the graph and just need to access if x and y are connected or not - this I need very often.我存储图表,只需要访问 x 和 y 是否连接 - 这是我经常需要的。 After initialising i never remove or add new vertices or edges (N = Number of vertices and M=number of edges given from the start)!初始化后,我永远不会删除或添加新的顶点或边(N = 顶点数,M = 从一开始就给出的边数)!

The one representation which is already available to me:我已经可以使用的一种表示形式:

An adjacency list rolled out into just one long list.一个邻接列表展开成一个长长的列表。 Along with this representation goes an array with starting indices for each vertex.除了这个表示之外,还有一个数组,其中包含每个顶点的起始索引。 Storage O(2M) and check if edge between x and y in an average of O(n/m)存储 O(2M) 并检查 x 和 y 之间的边缘是否平均为 O(n/m)

A representation I thought of:我想到的一个表示:

the idea is to, instead of rolling out the adjacency list into one array, do it with the adjacency matrix.这个想法是,而不是将邻接列表展开到一个数组中,而是使用邻接矩阵来完成。 So storage O(N^2)?那么存储 O(N^2)? Yes but I want to store an edge in one bit except of one byte.(actually 2 bits symmetricallywise) Example: Say N=8, then create an vector<uint8_t> of length 8 (64 bit).是的,但我想在一个字节之外的一位中存储一个边。(实际上是对称的 2 位)示例:假设 N=8,然后创建一个长度为 8(64 位)的向量<uint8_t>。 Init each entry on 0. If there is an edge between vertex 3 and vertex 5, then add pow(2,5) to the entry of my vector belonging to vertex 3 and symmetrically.在 0 上初始化每个条目。如果顶点 3 和顶点 5 之间有一条边,则将 pow(2,5) 添加到属于顶点 3 的向量的条目并对称。 So there is a 1 in the entry of vertex 3 at position of vertex 5 exactly when there is an edge between 3 and 5. After inserting my graph into this data structure I think one should be able to access neighborhood in constant time by just a binary operation: Are 3 and 5 connected?所以在顶点 5 的 position 的顶点 3 的条目中有一个 1,恰好在 3 和 5 之间有一条边时。将我的图插入这个数据结构后,我认为应该能够在恒定时间内访问邻域二元运算:3 和 5 是否相连? Yes if v[3] ^ pow(2,5) == 0. When there are more vertices than 8, then every vertex needs to get more than one entry in the vector and I need to perform one modulo and one division operation for accessing the correct spot.是的,如果 v[3] ^ pow(2,5) == 0。当顶点数超过 8 个时,每个顶点都需要在向量中获得多个条目,我需要为访问正确的位置。

What do you think of the second solution - is it maybe already known and in use?您如何看待第二种解决方案 - 它是否可能已经知道并正在使用? Am I wrong by thinking about an access of O(1)?考虑 O(1) 的访问我错了吗? Is it to much effort for no real performance improvement?没有真正的性能改进是否需要付出很多努力?

The reason for loading both representations in one big list is due to cache improvements I was told.将两种表示加载到一个大列表中的原因是我被告知缓存改进。

I am happy to get some feedback on this idea.我很高兴收到关于这个想法的一些反馈。 I might be way off - pls be kind in that case:D我可能会走得很远 - 在这种情况下请善待:D

A 1000x1000 matrix with 200,000 edges will be quite sparse.具有 200,000 条边的 1000x1000 矩阵将非常稀疏。 Since the graph is undirected, the edges in the matrix will be written twice:由于图是无向的,矩阵中的边将被写入两次:

VerticeA -> VerticeB   and   VerticeB -> VerticeA

You will end up filling up 40% of the matrix, the rest will be empty.您最终将填充 40% 的矩阵,rest 将为空。


Edges边缘

The best approach I can think of here is to use a 2D vector of booleans :我能想到的最好方法是使用booleans 的 2D 向量

std::vector<std::vector<bool>> matrix(1000, std::vector<bool>(1000, false));

The lookup will take O(1) time and std::vector<bool> saves space by using a single bit for each boolean value.查找将花费 O(1) 时间,并且std::vector<bool>通过为每个 boolean 值使用单个位来节省空间。 You will end up using 1Mbit or ~128kB (125 kB) of memory.您最终将使用 1Mbit 或 ~128kB (125 kB) 的 memory。

The storage is not necessarily an array of bool values, but the library implementation may optimize storage so that each value is stored in a single bit.存储不一定是 bool 值的数组,但库实现可以优化存储,以便每个值都存储在单个位中。

This will allow you to check for an edge like this:这将允许您检查这样的边缘:

if( matrix[3][5] )
{
    // vertice 3 and 5 are connected
}
else
{
    // vertice 3 and 5 are not connected
}

Vertices顶点

If the id values of the vertices form a continuous range of ints (eg 0,1,2,3,...,999) then you could store the color information in a std::vector<int> which has O(1) access time:如果顶点的 id 值形成一个连续的整数范围(例如 0,1,2,3,...,999),那么您可以将颜色信息存储在具有 O(1) 的std::vector<int>中) 访问时间:

std::vector<int> colors(1000);

This would use up memory equal to:这将用完 memory 等于:

1000 * sizeof(int) = 4000 B ~ 4 kB (3.9 kB)

On the other hand, if the id values don't form a continuous range of ints it might be a better idea to use a std::unordered_map<int, int> which will on average give you O(1) lookup time.另一方面,如果 id 值不形成连续的整数范围,那么使用std::unordered_map<int, int>可能是一个更好的主意,它平均会给你 O(1) 的查找时间。

std::unordered_map<int, int> map;

So eg to store and look up the color of vertice 4:因此,例如存储和查找顶点 4 的颜色:

map[4] = 5;            // assign color 5 to vertice 4
std::cout << map[4];   // prints 5

The amount of memory used by std::unordered_map<int, int> will be: std::unordered_map<int, int>使用的 memory 的数量将为:

1000 * 2 * sizeof(int) = 8000 B ~ 8 kB (7.81 kB)

All together, for edges :一起,对于边缘

Type类型 Memory Memory Access time访问时间
std::vector<std::vector<bool>> 125 kB 125 KB O(1) O(1)

and for vertices :对于顶点

Type类型 Memory Memory Access time访问时间
std::vector<int> 3.9 kB 3.9 KB O(1) O(1)
std::unordered_map<int, int> 7.8 kB 7.8 KB O(1) on avg. O(1)平均。

If you go for a bit matrix then the memory usage is O(V^2), so ~1Mb bits or 128KB, of which slightly less than half really are duplicates.如果您将 go 用于位矩阵,则 memory 的使用量为 O(V^2),因此约 1Mb 位或 128KB,其中略低于一半实际上是重复的。

If you make an array of the edges O(E) and another array of index into the edges from the vertexes to the first of its edge you use 200K*sizeof(int) or 800KB which is much more, half of it is also duplicates (AB and BA are the same) which here actually could be saved.如果您将边的数组 O(E) 和另一个索引数组放入从顶点到其边的第一个边的边中,则使用 200K*sizeof(int) 或 800KB 更多,其中一半也是重复的(AB 和 BA 是一样的)这里其实是可以保存的。 Same if you know (or can template you out of it) that the number of vertexes can be stored in an uint16_t half can be saved again.如果您知道(或可以从中模板化)顶点数可以存储在uint16_t中,则可以再次保存一半。

To save half you just check which of the Vertexes has the lower number and checks its edges.为了节省一半,您只需检查哪个顶点的数字较小并检查其边缘。

To find out when to stop looking you use the index on the next Vertex.要找出何时停止查找,请使用下一个顶点上的索引。

So with your numbers it is fine or even good to use a bit matrix.因此,对于您的数字,使用位矩阵很好,甚至很好。

The first problem comes when (V^2)/8 > (E*4) though the binary search in the Edge algorithm would still be much slower than checking a bit.第一个问题出现在 (V^2)/8 > (E*4) 时,尽管 Edge 算法中的二进制搜索仍然比检查一点要慢得多。 That would occur if we set E = V * 200 (1000 Vertexes vs 200K edges)如果我们设置 E = V * 200(1000 个顶点与 200K 边),就会发生这种情况

V*V/8 > V*200*4
V/8 > 200*4
V > 200*4*8 = 6400

That would be 5120000 ~ 5MB easily fits into a L3 cache nowadays.这将是 5120000 ~ 5MB 现在很容易放入 L3 缓存。 If the connectivity (here average number of connections per vertex) is higher than 200 so much the better.如果连通性(这里是每个顶点的平均连接数)高于 200,那就更好了。

Checking the edges will also cost lg2(connectivity)*K(mispredicts) which gets rather steep.检查边缘也将花费 lg2(connectivity)*K(mispredicts),它变得相当陡峭。 checking the bit matrix would be O(1).检查位矩阵将是 O(1)。

You would need to measure, among others when the bit matrix breaks the L3 significantly while the Edge list still fits L3 and when it spills over in virtual memory.您需要测量位矩阵何时显着破坏 L3 而边缘列表仍适合 L3 以及何时溢出到虚拟 memory 中。

In other words with a high connectivity the bit matrix should beat the Edge list with a much lower connectivity or much higher number of vertexes the Edge list might be faster.换句话说,具有高连通性的位矩阵应该击败具有低得多的连通性或多得多的顶点数的边缘列表,边缘列表可能会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM