简体   繁体   English

将图形数据结构映射到关系数据库是否有意义?

[英]Does it Make Sense to Map a Graph Data-structure into a Relational Database?

Specifically a Multigraph . 特别是一个Multigraph

Some colleague suggested this and I'm completely baffled. 一些同事提出了这一点,我完全感到困惑。

Any insights on this? 有什么见解吗?

It's pretty straightforward to store a graph in a database: you have a table for nodes, and a table for edges, which acts as a many-to-many relationship table between the nodes table and itself. 将图形存储在数据库中非常简单:您有一个节点表和一个边表,它充当节点表和它自身之间的多对多关系表。 Like this: 像这样:

create table node (
  id integer primary key
);

create table edge (
  start_id integer references node,
  end_id integer references node,
  primary key (start_id, end_id)
);

However, there are a couple of sticky points about storing a graph this way. 但是,关于以这种方式存储图形存在一些棘手的问题。

Firstly, the edges in this scheme are naturally directed - the start and end are distinct. 首先,这个方案中的边缘是自然导向的 - 起点和终点是不同的。 If your edges are undirected, then you will either have to be careful in writing queries, or store two entries in the table for each edge, one in either direction (and then be careful writing queries!). 如果你的边是无向的,那么你要么在编写查询时要小心,要么在表中为每个边存储两个条目,一个在任一方向(然后小心写查询!)。 If you store a single edge, i would suggest normalising the stored form - perhaps always consider the node with the lowest ID to be the start (and add a check constraint to the table to enforce this). 如果您存储单个边缘,我建议对存储的表单进行规范化 - 可能始终将具有最低ID的节点视为开始(并向表中添加检查约束以强制执行此操作)。 You could have a genuinely unordered representation by not having the edges refer to the nodes, but rather having a join table between them, but that doesn't seem like a great idea to me. 你可以有一个真正无序的表示,没有边缘引用节点,而是在它们之间有一个连接表,但这对我来说似乎不是一个好主意。

Secondly, the schema above has no way to represent a multigraph. 其次,上面的模式无法表示多图。 You can extend it easily enough to do so; 你可以很容易地扩展它来做到这一点; if edges between a given pair of nodes are indistinguishable, the simplest thing would be to add a count to each edge row, saying how many edges there are between the referred-to nodes. 如果给定节点对之间的边缘是不可区分的,最简单的方法是向每个边缘行添加一个计数,说明所引用节点之间有多少条边。 If they are distinguishable, then you will need to add something to the node table to allow them to be distinguished - an autogenerated edge ID might be the simplest thing. 如果它们是可区分的,那么您将需要向节点表添加一些内容以允许它们被区分 - 自动生成的边缘ID可能是最简单的事情。

However, even having sorted out the storage, you have the problem of working with the graph. 但是,即使整理了存储,您也会遇到使用图表的问题。 If you want to do all of your processing on objects in memory, and the database is purely for storage, then no problem. 如果你想对内存中的对象进行所有处理,而数据库纯粹用于存储,那么没问题。 But if you want to do queries on the graph in the database, then you'll have to figure out how to do them in SQL, which doesn't have any inbuilt support for graphs, and whose basic operations aren't easily adapted to work with graphs. 但是如果你想对数据库中的图形进行查询,那么你将不得不弄清楚如何在SQL中执行它们,它没有对图形的任何内置支持,并且其基本操作不容易适应使用图表。 It can be done, especially if you have a database with recursive SQL support (PostgreSQL, Firebird, some of the proprietary databases), but it takes some thought. 它可以完成,特别是如果你有一个带有递归SQL支持的数据库(PostgreSQL,Firebird,一些专有数据库),但它需要一些思考。 If you want to do this, my suggestion would be to post further questions about the specific queries. 如果你想这样做,我的建议是发布有关特定查询的进一步问题。

It's an acceptable approach. 这是一种可接受的方法。 You need to consider how that information will be manipulated. 您需要考虑如何操纵该信息。 More than likely you'll need a language separate from your database to do the kinds graph related computations this type of data implies. 您很可能需要一种与数据库分开的语言来执行此类数据所暗示的与图形相关的计算。 Skiena's Algorithm Design Manual has an extensive section graph data structures and their manipulation. Skiena的算法设计手册具有广泛的截面图数据结构及其操作。

Without considering what types of queries you might execute, start with two tables vertices and edges . 在不考虑可能执行的查询类型的情况下,从两个表verticesedges Vertices are simple, an identifier and a name. 顶点很简单,标识符和名称。 Edges are complex given the multigraph. 鉴于多图,边缘很复杂。 Edges should be uniquely identified by a combination two vertices (ie foreign keys) and some additional information. 边缘应由两个顶点(即外键)和一些附加信息的组合唯一标识。 The additional information is dependent on the problem you're solving. 附加信息取决于您正在解决的问题。 For instance, if flight information, the departure and arrival times and airline. 例如,如果航班信息,出发和到达时间以及航空公司。 Furthermore you'll need to decide if the edge is directed (ie one way) or not and keep track if that information as well. 此外,您需要确定边缘是否是定向的(即单向),并且如果该信息也是如此。

Depending on the computation you may end up with a problem that's better solved with some sort of artificial intelligence / machine learning algorithm. 根据计算结果,您最终可能会遇到使用某种人工智能/机器学习算法更好地解决的问题。 For instance, optimal flights. 例如,最佳航班。 The book Programming Collective Intelligence has some useful algorithms for this purpose. 编程集体智慧编程为此目的提供了一些有用的算法。 But where the data is kept doesn't change the algorithm itself. 但是保存数据的地方并没有改变算法本身。

Well, the information has to be stored somewhere, a relational database isn't a bad idea. 那么,信息必须存储在某个地方,关系数据库并不是一个坏主意。

It would just be a many-to-many relationship, a table of a list of nodes, and table of a list of edges/connections. 它只是一个多对多关系,一个节点列表表和一个边/连接列表。

Consider how Facebook might implement the social graph in their database. 考虑Facebook如何在他们的数据库中实现社交图。 They might have a table for people and another table for friendships. 他们可能有一张供人们使用的桌子和另一张友谊桌子。 The friendships table has at least two columns, each being foreign keys to the table of people. friendships表至少有两列,每列都是人员表的外键。

Since friendship is symmetric (on Facebook) they might ensure that the ID for the first foreign key is always less than the ID for the second foreign key. 由于友谊是对称的(在Facebook上),他们可能会确保第一个外键的ID始终小于第二个外键的ID。 Twitter has a directed graph for its social network, so it wouldn't use a canonical representation like that. Twitter有一个针对其社交网络的有向图,因此它不会使用这样的规范表示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM