简体   繁体   English

如何在Lucene / Solr / Elasticsearch索引或NoSQL数据库中存储树数据?

[英]How to store tree data in a Lucene/Solr/Elasticsearch index or a NoSQL db?

Say instead of documents I have small trees that I need to store in a Lucene index. 说而不是文件我有小树,我需要存储在Lucene索引中。 How do I go about doing that? 我该怎么做呢?

An example node in the tree: 树中的示例节点:

class Node
{
    String data;
    String type;
    List<Node> children;
}

In the above node the "data" member variable is a space separated string of words, so that needs to be full-text searchable. 在上面的节点中,“data”成员变量是一个空格分隔的单词串,因此需要全文可搜索。 The "type" member variable is just a single word. “type”成员变量只是一个单词。

The search query will be a tree itself and will search both the data and type in each node and also the structure of the tree for a match. 搜索查询将是树本身,并且将搜索每个节点中的数据和类型以及用于匹配的树的结构。 Before matching against a child node, the query must first match the parent node data and type. 在匹配子节点之前,查询必须首先匹配父节点数据和类型。 Approximate matching on the data value is acceptable. 可以接受数据值的近似匹配。

What's the best way to index this kind of data? 索引此类数据的最佳方法是什么? If Lucene does not directly support indexing these data then can this be done by Solr or Elasticsearch? 如果Lucene不直接支持索引这些数据,那么这可以由Solr或Elasticsearch完成吗?

I took a quick look at neo4j, but it seems to store an entire graph in the db, not a large collection (say billions or trillions) of small tree structures. 我快速浏览了neo4j,但似乎在db中存储了整个图形,而不是大型集合(比如数十亿或数万亿)的小树结构。 Or my understanding was wrong? 或者我的理解是错的?

Also, is a non-Lucene based NoSQL solution is better suited for this? 另外,基于非Lucene的NoSQL解决方案是否更适合这种情况?

Another approach is to store a representation of the current node's location in the tree. 另一种方法是在树中存储当前节点的位置的表示。 For example, the 17th leaf of the 3rd 2nd-level node of the 1st 1st-level node of the 14th tree would be represented as 014.001.003.017 . 例如,第14个树的第1个第1级节点的第3个第2级节点的第17个叶子表示为014.001.003.017

Assuming 'treepath' is the field name of the tree location, you would query on 'treepath:014*' to find all nodes and leaves in the 14th tree. 假设'treepath'是树位置的字段名称,您将查询'treepath:014 *'以查找第14个树中的所有节点和叶子。 Similarly, to find all of the children of the 14th tree you would query on 'treepath:014.*'. 同样,要查找第14个树的所有子项,您将在“treepath:014。*”上查询。

The major problem with this approach is that moving branches around requires re-ordering every branch after the branch that was moved. 这种方法的主要问题是移动分支需要在移动的分支之后重新排序每个分支。 If your trees are relatively static, that may only be a minor problem in practice. 如果你的树木相对静止,那在实践中可能只是一个小问题。

(I've seen this approach called either a 'path enumeration' or a 'Dewey Decimal' representation.) (我已经看到这种方法称为'路径枚举'或'杜威十进制'表示。)

This requirement and the solution is captured here: Proposal for nested docs 此要求和解决方案在此处捕获: 嵌套文档的提案

This design was subsequently implemented both by core Lucene and Elastic Search. 此设计随后由核心Lucene和Elastic Search实施。 The BlockJoinQuery is the core Lucene implementation and Elastic Search look to have an implementation as outlined here: Elastic search nested docs BlockJoinQuery是Lucene的核心实现,Elastic Search看起来有一个如下所述的实现: 弹性搜索嵌套文档

I suggest Neo4j. 我建议Neo4j。 Tree is, after all, just a special, restrained graph. 毕竟,树只是一个特殊的,受限制的图形。

Check out this great discussion on whether you should store a tree in Neo4j: 看看你是否应该在Neo4j中存储一个树的讨论:

http://www.mail-archive.com/user@lists.neo4j.org/msg03256.html http://www.mail-archive.com/user@lists.neo4j.org/msg03256.html

There is a project SIREn http://rdelbru.github.io/SIREn which deals with 'in-depth' trees, addressing. 有一个项目SIREn http://rdelbru.github.io/SIREn处理'深入'树,寻址。 Internally uses Dewey numbering ( http://www.ipl.org/div/farq/deweyFARQ.html ) .... 内部使用杜威编号( http://www.ipl.org/div/farq/deweyFARQ.html )....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM