简体   繁体   English

Arango DB性能:edge与DOCUMENT()

[英]Arango DB performace: edge vs. DOCUMENT()

I'm new to arangoDB with graphs. 我是arangoDB的新手,有图表。 I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed? 我只是想知道构建边缘是否更快或使用'DOCUMENT()'来进行非常简单的1:1连接,而不需要查询图形?

LET a = DOCUMENT(@from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})

vs VS

LET a = DOCUMENT(@from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}

Generally spoken, the latter variant 一般来说,后一种变体

LET a = DOCUMENT(@from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}

should have lower overhead than the full-featured traversal variant. 应该比全功能的遍历变体具有更低的开销。 This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc. 这是因为DOCUMENT变体将对文档进行点查找,而遍历变体是非常通用的:它可以从可变数量的集合返回零到多个结果,需要跟踪所看到的路径等。

When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim. 当我在本地测试用例中尝试两种变体时,非遍历变体也快得多,支持这种说法。

However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths. 但是,基于遍历的变体更灵活:如果存在多个边(没有1:1映射)和更长的路径,也可以使用它。

A simple benchmark you can try: 您可以尝试一个简单的基准:

Create the collections products , categories and an edge collection has_category . 创建集合productscategories和边集合has_category Then generate some sample data: 然后生成一些样本数据:

FOR i IN 1..10000
    INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
    INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
    LET random_categories = (
    FOR c IN categories
        SORT RAND()
        LIMIT 5
        RETURN c._id
    )
    LET category_subset = SLICE(random_categories, 0, RAND()*5+1)

    UPDATE p WITH {
        categories: category_subset,
        categoriesEmbedded: DOCUMENT(category_subset)[*].name
    } INTO products

    FOR cat IN category_subset
        INSERT {_from: p._id, _to: cat} INTO has_category

Then compare the query times for the different approaches. 然后比较不同方法的查询时间。

Graph traversal (depth 1..1): 图遍历(深度1..1):

FOR p IN products
    RETURN {
        product: p.name,
        categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
    }

Look-up in categories collection using DOCUMENT(): 使用DOCUMENT()在类别集合中查找:

FOR p IN products
    RETURN {
        product: p.name,
        categories: DOCUMENT(p.categories)[*].name
    }

Using the directly embedded category names: 使用直接嵌入的类别名称:

FOR p IN products
    RETURN {
        product: p.name,
        categories: p.categoriesEmbedded
    }

Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names. 图遍历是所有3中最慢的,另一个集合中的查找比遍历更快,但是最快的查询是具有嵌入类别名称的查询。

If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem. 但是,如果仅查询一个或几个产品的类别,则无论数据模型和查询方法如何,响应时间都应在亚毫秒区域,因此不会造成性能问题。

The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. 如果需要查询具有可变深度,长路径,最短路径等的路径,则应选择图形方法。对于您的用例,没有必要。 Whether the embedded approach is suitable or not is something you need to decide: 嵌入式方法是否合适是您需要决定的:

  • Is it acceptable to duplicate information, and potentially have inconsistencies in the data? 复制信息是否可以接受,并且数据中可能存在不一致之处? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID) (如果要更改类别名称,则需要在所有产品记录中更改它,而不仅仅是一个类别文档,产品可以通过不可变ID引用)

  • Is there a lot of additional information per category? 每个类别还有很多其他信息吗? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance 如果是这样,所有这些数据都需要嵌入到具有该类别的每个产品文档中 - 基本上是为了提高性能而交换内存/存储空间

  • Do you need to retrieve a list of all (distinct) categories often? 您是否需要经常检索所有(不同)类别的列表? You can do this type of query really cheap with the separate categories collection. 使用单独的类别集合,您可以非常便宜地执行此类查询。 With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info. 使用嵌入式方法,效率会低得多,因为您需要查看所有产品并收集类别信息。

Bottom line: you should choose the data model and approach that fits your use case best. 底线:您应该选择最适合您的用例的数据模型和方法。 Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues. 由于ArangoDB的多模型特性,如果您的用例发生变化或遇到性能问题,您可以轻松尝试其他方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM