简体繁体 English

Neo4j索引（与Lucene） - 组织节点“类型”的好方法？

[英]Neo4j indexing (with Lucene) - good way to organize node “types”?

原文 2012-03-26 20:39:13 6 3 java/ lucene/ indexing/ neo4j

This is more actually more of a Lucene question, but it's in the context of a neo4j database. 这实际上更像是一个Lucene问题，但它是在neo4j数据库的上下文中。

I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). 我有一个数据库，它被分为50个左右的节点类型（所以“其他类型的dbs中的”集合“或”表“）。 Each has a subset of properties that need to be indexed, some share the same name, some don't. 每个属性都有一个需要索引的属性子集，有些属性名称相同，有些则没有。

When searching, I always want to find nodes of a specific type, never across all nodes. 搜索时，我总是希望找到特定类型的节点，而不是所有节点。

I can see three ways of organizing this: 我可以看到三种组织方式：

One index per type, properties map naturally to index fields: index 'foo', 'id'='1234' . 每种类型一个索引，属性自然映射到索引字段：index'foo'， 'id'='1234' 。
A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ( 'id'='foo:1234' ) or check the nodes once they're returned (I expect duplicates to be very rare). 一个单独的全局索引，每个字段映射到一个属性名称，以区分该类型或者将它包含为值的一部分（ 'id'='foo:1234' ）或者在返回它们后检查它们（我希望重复到非常罕见）。
A single index, type is part of the field name: 'foo.id'='1234' . 单个索引类型是字段名称的一部分： 'foo.id'='1234' 。

Once created, the database is read-only. 创建后，数据库是只读的。

Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance? 在便利性，大小/缓存效率或性能方面，其中之一是否有任何好处？

As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. 据我所知，对于第一个选项，neo4j将为每种类型创建一个单独的物理索引，这似乎不是最理想的。 For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything. 对于第三个，我最终得到的大多数lucene文档只有一小部分字段，不确定是否会影响任何内容。

3 个解决方案

I came across this problem recently when I was building an ActiveRecord connection adapter for Neo4j over REST, to be used in a Rails project. 最近，当我在REST上为Neo4j构建一个ActiveRecord连接适配器时，我遇到了这个问题，以便在Rails项目中使用。 Since ActiveRecord and ActiveRelation , both, have a tight coupling with SQL syntaxes, it became difficult to fit everything into NoSQL. 由于ActiveRecord和ActiveRelation都与SQL语法紧密耦合，因此很难将所有内容都安装到NoSQL中。 Might not be the best solution, but here's how I solved it: 可能不是最好的解决方案，但这是我解决它的方式：

Created an index named model_index which indexed nodes under two keys, type and model 创建了一个名为model_index索引，该索引在两个键， type和model下索引节点
Index lookup with type key currently happens with just one value model . 使用type键进行索引查找当前仅使用一个值model 。 This was introduced primarily to achieve a SHOW TABLES SQL functionality which can get me a list of all models present in the graph. 这主要是为了实现一个SHOW TABLES SQL功能，它可以让我获得图表中所有模型的列表。
Index lookup with model key takes place with values corresponding to different model names in my system. 使用model键进行索引查找时，会在系统中使用与不同模型名称对应的值。 This is primarily for achieving DESC <TABLENAME> functionality. 这主要用于实现DESC <TABLENAME>功能。
With each table creation as in CREATE TABLE , a node is created with table definition attributes being stored in node properties. 在CREATE TABLE中CREATE TABLE每个表时，将创建一个节点，其中表定义属性存储在节点属性中。
Created node is indexed under model_index with type:model and model:<model-name> . 创建的节点在model_index下编制索引， type:model和model:<model-name> 。 This enables the newly created model in the list of 'tables' and also allows one to directly reach the model node by an index lookup with model key. 这使得在“表”列表中新创建的模型成为可能，并且还允许通过使用model键的索引查找直接到达模型节点。
For each record created per model (type in your case), an outgoing edge is created labeled instances directed from model node to this new record. 对于每个model创建的每个记录（在您的情况下为类型），将创建一个标记为从模型节点指向此新记录的instances的传出边。 v[123] :=> [instances] :=> v[245] where v[123] represents model node and v[245] represents a record of v[123]'s type. v[123] :=> [instances] :=> v[245]其中v [123]表示模型节点，v [245]表示v [123]类型的记录。
Now if you want to get all instances of a specified type, you could lookup the model_index with model:<model-name> to reach a model node and then fetch all adjacent nodes over an outgoing edge labeled instances . 现在，如果要获取指定类型的所有实例，可以使用model:<model-name>查找model_index以到达模型节点，然后在标记为instances的传出边缘上获取所有相邻节点。 Filtered lookups can be further achieved by applying filters and other complex traversals. 通过应用过滤器和其他复杂的遍历，可以进一步实现过滤查找。

The above solution prevents model_index from clogging since it contains 2x and achieves an effective record lookup via one index lookup and single-level traversal. 上述解决方案可防止model_index阻塞，因为它包含2x并通过一次索引查找和单级遍历实现有效的记录查找。

Although in your case, nodes of different types are not adjacent to each other, even if you wanted to do so, you could determine the type of any arbitrary node by simply looking up it's adjacent node with an incoming edge labeled instances . 虽然在您的情况下，不同类型的节点彼此不相邻，但即使您希望这样做，您也可以通过简单地查找具有标记为instances的传入边缘的相邻节点来确定任意节点的类型。 Further, I'm considering the incorporate SpringDataGraph's pattern of storing a __type__ property on each instance node to avoid this adjacent node lookup. 此外，我正在考虑合并SpringDataGraph的模式，在每个实例节点上存储__type__属性，以避免这种相邻节点查找。

I'm currently translating AREL to Gremlin scripts for almost everything. 我目前正在将AREL翻译成几乎所有的Gremlin脚本。 You could find the source code for my AR Adapter at https://github.com/yournextleap/activerecord-neo4j-adapter 您可以在https://github.com/yournextleap/activerecord-neo4j-adapter找到我的AR适配器的源代码

Hope this helps, Cheers! 希望这会有所帮助，干杯！ :) :)

A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. 单个索引将小于几个小索引，因为某些数据（例如术语词典）将被共享。 However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. 但是，由于术语字典查找是O（lg（n））操作，因此在较大术语字典中查找可能会慢一些。 (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.) （如果你有50个索引，这只需要6（2 ^ 6> = 50）个比较，你可能不会注意到任何差异。）

Another advantage of a smaller index is that the OS cache is likely to make queries run faster. 较小索引的另一个优点是OS缓存可能使查询运行得更快。

Instead of your options 2 and 3, I would index two different fields id and type and search for ( id :ID AND type :TYPE) but I don't know if it is possible with neo4j. 而不是你的选项2和3，我会索引两个不同的字段id和type并搜索（ id ：ID AND type ：TYPE），但我不知道是否可以使用neo4j。

spring-data-neo4j is using the first approach - it creates a different index for each type. spring-data-neo4j正在使用第一种方法 - 它为每种类型创建不同的索引。 So I guess that's a good option for the general scenario. 所以我想这对于一般情况来说是个不错的选择。 But in your particular case it might be suboptimal, as you say. 但正如你所说，在你的特殊情况下，它可能不是最理想的。 I'd run some benchmarks to measure the performance. 我会运行一些基准测试来衡量性能。

The other two, by the way, seem a bit artificial. 顺便说一下，另外两个看起来有点人为。 You are possibly indexing completely unrelated information in the same index, which doesn't sound right. 您可能在同一索引中索引完全不相关的信息，这听起来不对。