简体   繁体   English

如何在关系数据库中持久化图形数据结构?

[英]How to persist a graph data structure in a relational database?

I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups?我考虑过创建一个 Vertices 表和一个 Edges 表,但是在 memory 中构建图和遍历子图是否需要大量查找? I'd like to avoid excessive database reads.我想避免过多的数据库读取。 Is there any other way of persisting a graph?还有其他方法可以保留图形吗?

Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database.旁注:我听说过 Neo4j,但我的问题实际上是如何在概念上表示标准数据库中的图形。 I am open to some NoSQL solutions like mongodb though.不过,我对一些 NoSQL 解决方案持开放态度,例如 mongodb。

The answer is unfortunately: Your consideration is completely right in every point. 遗憾的是答案:你的考虑在每个方面都是完全正确的。 You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. 您必须将节点(顶点)存储在一个表中,并且Edges引用FromNode和ToNode以将图形数据结构转换为关系数据结构。 And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. 你也是对的,这最终会导致大量的查找,因为你无法将它分成子图,可能会立即查询。 You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets). 您必须从节点遍历到边缘到节点到边缘到节点...依此类推(递归,而SQL正在使用集合)。

The point is... 重点是...

Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. 关系,面向图,面向对象,基于文档是满足不同要求的不同类型的数据结构。 Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way. 这就是它的全部内容以及为什么这么多不同的NoSQL数据库(大多数都是简单的文档存储)出现了,因为以关系方式组织大数据毫无意义。

Alternative 1 - Graph oriented database 备选1 - 面向图形的数据库

But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. 但是也有面向图形的NoSQL数据库,这使得图形数据模型成为像OrientDB这样的一流公民,我现在正在玩一点点。 The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (ie by querying with plain old SQL). 关于它的好处是,尽管它将数据保存为图形,但它仍然可以以关系或甚至面向对象或面向文档的方式使用(即通过查询普通的旧SQL)。 Nevertheless Traversing the graph is the optimal way to get data out of it for sure. 然而, 遍历图表是确保从中获取数据的最佳方式。

Alternative 2 - working with graphs in memory 备选方案2 - 使用内存中的图形

When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. 在快速路由方面,像Graphhopper这样的路由框架在内存中构建了完整的Graph(数十亿节点)。 Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. 因为Graphhopper使用其GraphStore的MemoryMapped实现,甚至可以在仅需要一些MB内存的Android设备上运行。 The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database. 完整的图形在启动时从数据库读入存储器,然后在那里完成路由,因此您无需查找数据库。

I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory: 我遇到了同样的问题,并决定最终使用以下结构,这需要2个数据库查询,然后其余的工作在内存中:

Store nodes in a table and reference the graph with each node record: 将节点存储在表中并使用每个节点记录引用该图:

Table Nodes

id  | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2

Also store edges in another table and again reference the graph these edges belong to with each edge: 还将边存储在另一个表中,并再次引用这些边所属的图与每个边:

Table Edges

id | from_node_id | to_node_id | graph_id
-----------------------------------------
1  | 105          | 106        | 2
2  | 106          | 105        | 2

Get all the nodes with one query, then get all the edges with another. 使用一个查询获取所有节点,然后使用另一个获取所有边缘。

Now build your preferred way to store the graph (eg, adjacency list) and proceed with your application flow. 现在构建您存储图形的首选方式(例如,邻接列表)并继续您的应用程序流程。

I am going to disagree with the other posts here.我不同意这里的其他帖子。 If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).如果你有特殊的 class 有限制的图,你通常可以通过更专业的设计(例如,每个顶点的边数有限,只需要遍历一种方式等)。

However, for storing an arbitrary graph, relational databases are an excellent choice.然而,对于存储任意图,关系数据库是一个很好的选择。 They're designed with an incredibly good set of tradeoffs that perform well in almost all situations.它们的设计具有令人难以置信的良好权衡,几乎在所有情况下都表现良好。 In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.此外,数据需求往往会随着时间的推移而变化,而关系数据库可以让您轻松地更改存储和查找,而无需更改数据表示。

Let's review your design:让我们回顾一下您的设计:

  • one table for vertices (id, data)一张顶点表(id,数据)
  • one table for edges (startId, endId, data)一张边表(startId、endId、数据)

First observe that the storage is efficient as it is proportional to the data to store.首先观察存储效率,因为它与要存储的数据成正比。 If we have 10 vertices and 10 edges, we store 20 pieces of information.如果我们有 10 个顶点和 10 条边,我们存储 20 条信息。

Now, let's look at lookup.现在,让我们看看查找。 Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).假设我们在顶点 id 上有一个索引,我们可以在至少log(n)中查找我们想要的任何数据(根据索引可能更好)。

  • Given a node tell me the edges leaving it给定一个节点告诉我离开它的边缘
  • Given a node tell me the edges entering it给定一个节点告诉我进入它的边
  • Given an edge tell me the node it came from or enters给定一条边告诉我它来自或进入的节点

That's all the basic queries you need.这就是您需要的所有基本查询。

Now suppose you had a "graph database" that stores a list of edges leaving each vertex.现在假设您有一个“图形数据库”,它存储离开每个顶点的边列表。 This makes each vertex variable size.这使得每个顶点的大小可变。 It a little easier to traverse.它更容易遍历。 But, what if you want to traverse the other direction?但是,如果您想遍历另一个方向怎么办? Now you have you store a list of edges entering each vertex as well.现在您还存储了进入每个顶点的边列表。 Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.现在您有该信息的两个副本,数据库(或开发人员)必须做大量工作以确保它们永远不会不同步。

O(log(n)) vs O(1) O(log(n)) 与 O(1)

Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.关系数据库索引通常以排序的形式存储数据,或者正如其他人指出的那样,也可以使用 hash 表。 Even if you are stuck with sorted it's going to perform very well.即使您坚持使用 sorted,它也会表现得很好。

First note that big oh measures scalability, not performance.首先请注意,big oh 衡量的是可伸缩性,而不是性能。 Hashes, can be slower than many loops for small data sets.对于小数据集,哈希可能比许多循环慢。 Even though hashing O(1) is better, binary search O(log2) is pretty darn good.即使散列O(1)更好,二分查找O(log2)也非常好。 You can search a billion records in 30 steps, In addition.此外,您可以在 30 个步骤中搜索十亿条记录。 it is cache and branch predictor friendly.它是缓存和分支预测器友好的。

Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017 . 将以前的答案添加到MS SQL Server 从2017年开始添加对Graph Architecture的支持这一事实。

It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords). 它遵循所描述的具有节点边缘表的模式(应该使用特殊的“AS NODE”和“AS EDGE”关键字创建)。 节点和边表结构

It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example): 它还有新的MATCH关键字介绍“支持模式匹配和遍历图形”这样(朋友是边缘表的名称在下面的例子中):

SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';

There is also a really good set of articles on SQL Server Graph Databases on redgate Hub . 关于redgate Hub上的SQL Server图形数据库,还有一组非常好的文章。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM