简体   繁体   English

使用Spark SQL在1.5亿个mongodb集合上提供同步服务的最佳方法是什么?

[英]Best approach for a sync service over a 150Millions mongodb collection with spark sql?

I have a 150M document MongoDB collection in a single MongoDB instance. 我在一个MongoDB实例中有一个1.5亿个文档 MongoDB集合。 Each document is a product. 每个文档都是一个产品。 A product has a price and a category. 产品具有价格和类别。 ie: 即:

{
  category: "shoes",
  price: 20,
  .
  .
  .
} 

I would like to expose a rest API method to make synchronous queries over this collection. 我想公开一个rest API方法来对此集合进行同步查询。 Ie: what is the average price of all the products for the given category X. 即:给定类别X的所有产品的平均价格是多少?

So far, I have tried to implement it in two different ways -- and both seem too slow for exposing a synchronous service (the client would have to wait too long): 到目前为止,我尝试用两种不同的方式来实现它-两者都太慢了,无法公开同步服务(客户端将不得不等待太长时间):

  1. Using native MongoDB aggregators : Using native MongoDB aggregators seems too slow when the number of products to sum is really big. 使用本机MongoDB聚合器 :当要累加的产品数量确实很大时,使用本机MongoDB聚合器似乎太慢。

  2. MongoDB + Spark SQL : Using filtering push down to get the products of a given category and processing the average price within the spark cluster nodes. MongoDB + Spark SQL :使用过滤下推来获取给定类别的产品并处理spark集群节点内的平均价格。 It takes too long for this approach to load into the cluster memory the products collection. 这种方法花费太长时间才能将产品集合加载到群集内存中。 (it took 13 min for a collection of 80k products in an AWS EMR cluster with 1 master and 2 slaves) (在具有1个主设备和2个从设备的AWS EMR集群中,花了13分钟收集了8万个产品)

So my questions are : 所以我的问题是

a) Should approach #2 work? a)方法2应该起作用吗? Should this approach be fast enough and so I am doing something wrong? 这种方法是否应该足够快,所以我做错了什么?

b) What is the best way to achieve this? b)达到此目的的最佳方法是什么? What is the best solution to achieve this from the architecture point of view? 从架构的角度来看,实现此目标的最佳解决方案是什么?

c) How would you do that? c)您将如何做?

Many thanks! 非常感谢!

Querying a 150M documents collection sitting on a single server expecting high speed seems a too much of an ask in my humble opinion. 在我的拙见中,查询位于一台服务器上的1.5亿个文档集合(期望高速)似乎是一个很大的问题。

With regards to option a), an aggregation pipeline would be executed across all the shards of the collection (unless the $match is on the shard key). 关于选项a),将在集合的所有分片上执行聚合管道(除非$match在分片键上)。 Each node would then take care of finding the ones in their own shard, hence distributing the workload. 然后,每个节点将负责在自己的分片中查找它们,从而分配工作量。 That should provide faster response times (and CPU time for other concurrent queries, if any). 这应该提供更快的响应时间(以及其他并发查询的CPU时间,如果有的话)。

With regards to option b), if I understand correctly, you would end up streaming 150M records through Spark. 关于选项b),如果我理解正确,您最终将通过Spark流式传输1.5亿条记录。 I am not sure where is the advantage you see coming from this approach. 我不确定您会从这种方法中获得什么好处。

Therefore, regarding c), the TL;DR is aggregation on sharded collection . 因此,对于c),TL; DR是分片集合上的聚合

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM