[英]Can a query be written in ArangoDB to aggregate values within joined documents?
假设您拥有普通会员和高级会员的电影订阅服务。
这是由用户活动生成的数据示例,并作为文档存储在集合中:
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
您可以看到有两个“ eventTypes”:
“ sessionInfo”文档具有整个用户会话共有的信息
“ mediaPlay”文档存储了观看电影的秒数。
(每个“ mediaPlay”事件都包含sessionGroupID,因此可以将其与该会话相关联。)
给定总计数千万个文档,您将如何编写一个查询,该查询将按userType分组的每部电影的观看时间总计?
所需的查询结果:
premium users - total of "elapsed":
xmen: 500
starwars: 200
normal users - total of "elapsed":
xmen: 115
starwars: 25
如果对于这样的查询,数据的结构不是最佳的,那么理想的结构是什么?
像这样?
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
}
]
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
}
]
感谢您提供的所有指导和建议!
以下查询遍历该集合,并收集按userTypes分组的所有会话ID。 然后,它创建一个子查询,该子查询将遍历集合并收集所有电影以及经过的时间之和,其中eventType
为“ mediaPlay”,并且所收集的会话包含sessionGroupID
。
@@coll
是一个绑定参数 ,其中包括您的集合名称。
FOR doc IN @@coll
FILTER doc.eventType == "sessionInfo"
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN @@coll
FILTER event.sessionGroupID IN sessions
FILTER event.eventType == "mediaPlay"
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
该查询的结果是:
[
{
"userTypes": "normal",
"movies": [
{
"movie": "starwars",
"elapsed": 25
},
{
"movie": "xmen",
"elapsed": 115
}
]
},
{
"userTypes": "premium",
"movies": [
{
"movie": "starwars",
"elapsed": 200
},
{
"movie": "xmen",
"elapsed": 500
}
]
}
]
关于第二个问题。 嵌套数组/对象不会优化此查询,但是您应该将数据分成两个集合。 每个eventType
(例如,将集合命名为eventType sessionInfo
和mediaPlay
)。 这减少了所需的过滤器语句的数量,更重要的是,它使您可以分别通过sessionInfos和mediaPlays查询,从而极大地提高了性能。
该查询将如下所示:
FOR doc IN sessionInfo
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN mediaPlay
FILTER event.sessionGroupID IN sessions
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.