简体   繁体   中英

ArangoDB - Is indexing better, than having more collections?

I have 3 types of entity:

  • Subjects
  • Topics
  • Tasks

In each subjects there are topics and tasks . The topics can depend on each other. (Of course, a topic that belongs to sj1 subject , can only be depended on an another topic that also belongs to sj1 subject .)

Between tasks and topics there are connections (also must belong to same subject) that symbolise the fact that to solve a certain task we need to be aware of certain topics .

So a task can require more topics . Also a topic can be required by more tasks . ( N<--->M connection.)

What would be the best solution to store?

  1. solution

    • Have 3 collections for each type of entity
    • In tasks and topics have an index for a subject identifier attribute.
    • and an edge collection for storing connections between topics [N]<-->[M] tasks
  2. solution

    • Have 1 collection for the subjects
    • For each subject , have 1 topics , and 1 tasks collections. The connection between subjects and tasks/topics can be based on prefix of collection names. (Ie for chemistry subject we have chemistry_tasks and chemistry_topics collections)
    • For each subject , have an edge collection for connections between the tasks and topics and an another edge collection for connections among topics (Ie chemistry_topics_tasks_connections and chemistry_topics_connections )

    This way if I want to search among topics or tasks of a subject, I don't need to pre-filter them based on the subject identifier index. I'll immediately get the desired collection that contains all of my data. Moreover I don't have overhead of index for each document in tasks and topics . On the other hand, this will result in a mess of collections.


Sidenote: There will be maximum 50 subjects, but the number of tasks and topics are unlimited.

In your terms, "awareness" is generated through the "graph", which requires no extra indexing to work at it's best. ArangoDB automatically creates special "_key" and "_from/_to" indexes, which it uses for graph traversal.

But as for indexing, that about all search performance - indexes are added based on the data you want to find. It really comes down to how you want to search:

  • one collection with multiple entity types or
  • multiple collections segregated by entity type.

There is not a penalty for having large collections, and a graph can link documents within a single collection - it doesn't need them to be segregated. Also, you can have multiple edge collections and/or multiple document collections. These are some of the concepts that challenge those of us who, like me, come from a traditional RDBMS - "schemaless" or "multi-model" databases kinda turn normalization on its ear.

Personally, I choose to build fairly large collections based on the data source (I import a data from external sources). Each collection contains documents of multiple object/data schema identified by an objType attribute. The benefit here is that you can search all documents in the collection on a single field (or even an index with multiple fields, like title + objType ), very quickly reducing the set of documents to iterate/traverse - this is usually where real performance gains are made.

So... I guess I recommend solution #3 ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM