简体繁体 English

数据库存储稀疏矩阵

[英]Database to store sparse matrix

原文 2011-12-01 03:57:11 9 2 ruby/ database/ nosql/ sparse-matrix

I have a very large and very sparse matrix, composed of only 0s and 1s. 我有一个非常大且非常稀疏的矩阵，仅由0和1组成。 I then basically handle (row-column) pairs. 然后我基本上处理（行 - 列）对。 I have at most 10k pairs per row/column. 每行/每列最多10k对。

My needs are the following: 我的需求如下：

Parallel insertion of (row-column) pairs 并行插入（行 - 列）对
Quick retrieval of an entire row or column 快速检索整行或列
Quick querying the existence of a (row-column) pair 快速查询（行 - 列）对的存在
A Ruby client if possible 如果可能的话，Ruby客户端

Are there existing databases adapted for these kind of constraints? 现有的数据库是否适合这种限制？

If not, what would get me the best performance : 如果没有，什么会让我获得最佳表现：

A SQL database, with a table like this: 一个SQL数据库，带有如下表：

row(indexed) | column(indexed) row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed) row(indexed) | column(indexed) （但索引必须不断刷新）

A NoSQL key-value store, with two tables like this: 一个NoSQL键值存储，有两个这样的表：

row => columns ordered list

column => rows ordered list

(but with parallel insertion of elements to the lists) （但是将元素并行插入列表）

Something else 还有别的

Thanks for your help! 谢谢你的帮助！

2 个解决方案

A sparse 0/1 matrix sounds to me like an adjacency matrix , which is used to represent a graph. 稀疏的0/1矩阵听起来像一个邻接矩阵，用于表示图形。 Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs. 基于此，您可能正在尝试解决某些图形问题，并且图形数据库可以满足您的需求。

Graph databases, like Neo4J , are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. 像Neo4J这样的图形数据库非常适合快速遍历图形，因为检索顶点的邻居需要O（给定顶点的邻居数），因此它与整个图形中的顶点数量无关。 Neo4J is also transactional, so parallel insertion is not a problem. Neo4J也是事务性的，因此并行插入不是问题。 You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration. 您可以使用MRI Ruby中的REST API包装器或JRuby库来实现更加无缝的集成。

On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel . 另一方面，如果您正在尝试分析图中的连接，并且偶尔进行一次分析就足够了并且只是使结果可用，那么您可以尝试使用基于图形处理的框架。 Google Pregel 。 It's a little bit like Map-Reduce, but aimed toward graph processing. 它有点像Map-Reduce，但是针对图形处理。 There are already several open source implementations of that paper . 该论文已经有几个开源实现。

However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase , which is an open-source, column-oriented data store based on Google BigTable . 但是，如果图形数据库或图形处理框架不适合您的需求，我建议您查看HBase ，它是一个基于Google BigTable的开源，面向列的数据存储。 It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. 它的数据模型实际上与您描述的（稀疏矩阵）非常相似，它具有行级事务，并且不需要您检索整行，只是为了检查某个对是否存在。 There are some Ruby libraries for that database , but I imagine that it would be safer to use JRuby instead of MRI for interacting with it. 该数据库有一些Ruby库，但我想用JRuby代替MRI与它进行交互会更安全。

If your matrix is really sparse (ie the nodes only have a few interconnections) then you would get reasonably efficient storage from a RDBMS such as Oracle, PostgreSQL or SQL Server. 如果您的矩阵非常稀疏（即节点只有少量互连），那么您将从RDBMS（如Oracle，PostgreSQL或SQL Server）获得合理有效的存储。 Essentially you would have a table with two fields (row, col) and an index or key each way. 基本上你会有一个表，其中包含两个字段（row，col）和一个索引或键。

Set up the primary key one way round (depending on whether you mostly query by row or column) and make another index on the fields the other way round. 单向设置主键（取决于您主要是按行还是列查询）并在字段上反向创建另一个索引。 This will only store data where a connection exists, and it will be proportional to the number ot edges in the graph. 这将仅存储连接存在的数据，并且它将与图中的边数成比例。

The indexes will allow you to efficiently retrieve either a row or column, and will always be in sync. 索引将允许您有效地检索行或列，并始终保持同步。

If you have 10,000 nodes and 10 connections per node the database will only have 100,000 entries. 如果每个节点有10,000个节点和10个连接，则数据库将只有100,000个条目。 100 ednges per node will have 1,000,000 entries and so on. 每个节点100个边缘将有1,000,000个条目，依此类推。 For sparse connectivity this should be fairly efficient. 对于稀疏连接，这应该是相当有效的。

A back-of-fag-packet estimate 背fag包估计

This table will essentially have a row and column field. 该表基本上有一个行和列字段。 If the clustered index goes (row, column, value) then the other covering index would go (column, row, value). 如果聚集索引（行，列，值），那么另一个覆盖索引将（列，行，值）。 If the additions and deletions were random (ie not batched by row or column), the I/O would be approximatley double that for just the table. 如果添加和删除是随机的（即不按行或列进行批处理），则I / O将仅为表的两倍。

If you batched the inserts by row or column then you would get less I/O on one of the indexes as the records are physically located together in one of the indexes. 如果您按行或列对插入进行批处理，那么您将在其中一个索引上获得较少的I / O，因为记录实际上位于其中一个索引中。 If the matrix really is sparse then this adjacency list representation is by far the most compact way to store it, which will be much faster than storing it as a 2D array. 如果矩阵确实是稀疏的，那么这种邻接列表表示是迄今为止存储它的最紧凑的方式，这将比将其存储为2D阵列快得多。

A 10,000 x 10,000 matrix with a 64 bit value would take 800MB plus the row index. 具有64位值的10,000 x 10,000矩阵将需要800MB加上行索引。 Updating one value would require a write of at least 80k for each write (writing out the whole row). 更新一个值需要为每次写入写入至少80k（写出整行）。 You could optimise writes by rows if your data can be grouped by rows on inserts. 如果可以按插入行上的行对数据进行分组，则可以按行优化写入。 If the inserts are realtime and random, then you will write out an 80k row for each insert. 如果插入是实时和随机的，那么您将为每个插入写出80k行。

In practice, these writes would have some efficiency because the would all be written out in a mostly contiguous area, depending on how your NoSQL platform physically stored its data. 在实践中，这些写入会有一定的效率，因为它们都会在一个大部分连续的区域中写出来，这取决于NoSQL平台物理存储其数据的方式。

I don't know how sparse your connectivity is, but if each node had an average of 100 connections, then you would have 1,000,000 records. 我不知道您的连接是多么稀疏，但如果每个节点平均有100个连接，那么您将拥有1,000,000条记录。 This would be approximately 16 bytes per row (Int4 row, Int4 column, Double value) plus a few bytes overhead for both the clustered table and covering index. 这将是每行约16个字节（Int4行，Int4列，Double值）加上聚簇表和覆盖索引的几个字节开销。 This structure would take around 32MB + a little overhead to store. 这种结构需要大约32MB +一点点开销才能存储。

Updating a single record on a row or column would cause two single disk block writes (8k, in practice a segment) for random access, assuming the inserts aren't row or column ordered. 更新行或列上的单个记录将导致两个单个磁盘块写入（8k，实际上是一个段）用于随机访问，假设插入不是按行或列排序。

Adding 1 million randomly ordered entries to the array representation would result in approximately 80GB of writes + a little overhead. 向数组表示中添加100万个随机排序的条目将导致大约80GB的写入+一点开销。 Adding 1m entries to the adjacency list representation would result in approximately 32MB of writes (16GB in practice because the whole block will be written for each index leaf node), plus a little overhead. 将1m条目添加到邻接列表表示将导致大约32MB的写入（实际上16GB，因为将为每个索引叶节点写入整个块），加上一点开销。

For that level of connectivity (10,000 nodes, 100 edges per node) the adjacency list will be more efficient in storage space, and probably in I/O as well. 对于该级别的连接（10,000个节点，每个节点100个边缘），邻接列表在存储空间中更有效，并且可能在I / O中也是如此。 You will get some optimisation from the platform, so some sort of benchmark might be appropriate to see which is faster in practice. 您将从平台获得一些优化，因此某种基准可能适合于查看哪种更快的实践。