简体   繁体   中英

Neo4j graph database design and efficient query

Let me first explain what I want to model using neo4j (v2)

Let assume a n-dimensional dataset on the form:

val1Dim1, ... , val1Dimn, classValue1
val2Dim2, ... , val2Dimn, classValue2
....

Each dimension is provided with a hierarchy (let say a tree). The total number of "dimensions nodes" is around 1K or slightly higher depending on the dataset.

A data mining approach ( link to the scientific paper ) is run over the dataset and a huge number of patterns is extracted out of the dataset.

Basically, each pattern is on the form:

{a set of value of Dim1} {a set of value of Dim2} ... {a set of class values}

There are at least around 11M mined patterns.

My design choice

2 types of nodes (labels):

  • DATA (for instance val1Dim1 is a DATA node) => around 1K nodes. These nodes have three properties: LABEL (the value itself), the dimension id,DIMENSION, and a built property, KEY, that is "DIMENSION_LABEL". An index has been defined on KEY.

  • PATTERN (one per pattern) => at least 11M nodes

2 type of relationship:

  • IS_A to represent the generalization/specialization relationship to navigate through hierarchies

  • COMPOSED_BY to link a pattern to each of its member (for instance if P={val1dim1,val2Dim1} {val1Dim2} is a pattern, then 3 relationships, ie, P->va11Dim1, P->val2Dim1 and val1Dim1, are created.

Here is a toy graphDb to make my design choices clear 在此处输入图片说明

Data insertion and specifications

I have used batch inserter and its works pretty fast (around 40 minutes). The size of the DB is around 50Gb and is composed by around 11M nodes and 1B (!!) relationships. For now, I am running code on my machine (8GB of RAM, Intel i7 and 500GB of SSD HD). I am using Java.

What I'd like to do

Given a value per dimension, I would like to know what are the patterns such that all the dimension values are involved in the pattern.

Currently, assuming 2 dimensions the query I am using is to achieve my goal is:

match (n:DATA {KEY:'X'})-[r:COMPOSED_BY]-(p:PATTERN)-[r2:COMPOSED_BY]-(m:DATA {KEY:'Y'}) 
return p;

For now, it is very very slow... And the memory usage of the java process is 2GB (maximum)

My questions

  1. Do you think a graphDb is appropriated for such a scenario?
  2. Are my design choices ok?
  3. What about indexes? Do I need to define some more?
  4. Is the way to query the db ok?
  5. Is there some configuration tricks to speed up the query phase?
  6. What would be the server specifications that will suit my application needs?

Thanks in advance

Yoann

I have few suggestions. You can use Node Labels (not as property of node) . For knowing more about node labels see here

So if you use labels, all the labels of a particular dimension will automatically be classified under one set(ie the label). Hence you will reduce the number of relations that you maintain as IS_A . And as relationships are more expensive space wise, you can reduce the size of your database. Moreover indexed searches on Labels are also available and fast than searching for keys in the entire index.

In the model below under each dimension node( DATA ) I have added two attributes key and value , you can rather keep only one of them as key and then simply index over it. So when you would need the value just parse the key.(Just a suggestion dont know about the kind of usecases you are going to have)

Suggestions and comments are welcome.

comment back if you need more info.


Edit after comment

As per your comment, in order to reduce the number of pattern nodes you can link the DATA nodes itself by creating unique relationshipTypes naming them according to the PATTERNS . See the updated diagram for more clarification

我建议的模型

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM