简体   繁体   English

是否可以在ArangoDB中编写查询以汇总联接文档中的值?

[英]Can a query be written in ArangoDB to aggregate values within joined documents?

Let's say you have a movie subscription service with normal and premium memberships. 假设您拥有普通会员和高级会员的电影订阅服务。

Here is a sample of data generated by user activity and stored as documents in a collection: 这是由用户活动生成的数据示例,并作为文档存储在集合中:

[
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 1
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "starwars",
        "sessionGroupID": 1,
        "elapsed": 200
    },
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 2
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 2,
        "elapsed": 500
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 3
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 3,
        "elapsed": 10
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 4
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 4,
        "elapsed": 100
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 5
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 5,
        "elapsed": 5
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "starwars",
        "sessionGroupID": 5,
        "elapsed": 25
    }
]

You can see that there are two “eventTypes”: 您可以看到有两个“ eventTypes”:

  • “sessionInfo” documents that have information common to an entire user session “ sessionInfo”文档具有整个用户会话共有的信息

  • “mediaPlay” documents that store how many seconds of a movie was viewed. “ mediaPlay”文档存储了观看电影的秒数。

(Each “mediaPlay” event contains the sessionGroupID so it can be associated with that session.) (每个“ mediaPlay”事件都包含sessionGroupID,因此可以将其与该会话相关联。)


Question #1: 问题1:

Given tens of millions of documents total, how would you write a query that totaled the elapsed viewing time of each movie, grouped by userType? 给定总计数千万个文档,您将如何编写一个查询,该查询将按userType分组的每部电影的观看时间总计?

Desired query results: 所需的查询结果:

premium users - total of "elapsed":
    xmen: 500
    starwars: 200

normal users - total of "elapsed":
    xmen: 115
    starwars: 25

Question #2: 问题2:

If the data is not structured optimally for such a query, what would be the ideal structure? 如果对于这样的查询,数据的结构不是最佳的,那么理想的结构是什么?

  • For example, would it be better to nest the "mediaPlay" events inside each "sessionInfo" docs as a nested array? 例如,将“ mediaPlay”事件嵌套在每个“ sessionInfo”文档中作为嵌套数组会更好吗?

Like this? 像这样?

[
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 1,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "starwars",
                "sessionGroupID": 1,
                "elapsed": 200
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 2,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 2,
                "elapsed": 500
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 3,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 3,
                "elapsed": 10
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 4,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 4,
                "elapsed": 100
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 5,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 5,
                "elapsed": 5
            },
            {
                "eventType": "mediaPlay",
                "productSKU": "starwars",
                "sessionGroupID": 5,
                "elapsed": 25
            }
        ]
    }
]

Thanks for any and all guidance and advice! 感谢您提供的所有指导和建议!

The following query iterates over the collection and collect all session IDs grouped by the userTypes. 以下查询遍历该集合,并收集按userTypes分组的所有会话ID。 Then it creates a subquery which iterates over the collection and collect all movies and the sum of the elapsed time where eventType is "mediaPlay" and the collected sessions contains the sessionGroupID . 然后,它创建一个子查询,该子查询将遍历集合并收集所有电影以及经过的时间之和,其中eventType为“ mediaPlay”,并且所收集的会话包含sessionGroupID

The @@coll is a bind parameter which includes your collection name. @@coll是一个绑定参数 ,其中包括您的集合名称。

FOR doc IN @@coll
  FILTER doc.eventType == "sessionInfo"
  COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
  RETURN {
    "userTypes" : userTypes,
    "movies" : (
      FOR event IN @@coll
        FILTER event.sessionGroupID IN sessions
        FILTER event.eventType == "mediaPlay"
        COLLECT movie = event.productSKU INTO elapsed = event.elapsed
        RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
      )
  }

The result of this query is: 该查询的结果是:

[
  {
    "userTypes": "normal",
    "movies": [
      {
        "movie": "starwars",
        "elapsed": 25
      },
      {
        "movie": "xmen",
        "elapsed": 115
      }
    ]
  },
  {
    "userTypes": "premium",
    "movies": [
      {
        "movie": "starwars",
        "elapsed": 200
      },
      {
        "movie": "xmen",
        "elapsed": 500
      }
    ]
  }
]

Regarding your second question. 关于第二个问题。 Nested arrays/objects wouldn't optimise this query but you should split your data into two collections. 嵌套数组/对象不会优化此查询,但是您应该将数据分成两个集合。 One for every eventType (eg name the collections like the eventType sessionInfo and mediaPlay ). 每个eventType (例如,将集合命名为eventType sessionInfomediaPlay )。 This reduces the number of needed filter statements and more important, it allows you to query separately over sessionInfos and mediaPlays which highly boost your performance. 这减少了所需的过滤器语句的数量,更重要的是,它使您可以分别通过sessionInfos和mediaPlays查询,从而极大地提高了性能。

The query would then look like: 该查询将如下所示:

FOR doc IN sessionInfo
  COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
  RETURN {
    "userTypes" : userTypes,
    "movies" : (
      FOR event IN mediaPlay
        FILTER event.sessionGroupID IN sessions
        COLLECT movie = event.productSKU INTO elapsed = event.elapsed
        RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
      )
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM