简体繁体 English

如何处理来自不同服务器的多个数据库结果以获取请求

[英]How to deal with multiple database results from different servers for a request

原文 2016-06-30 06:42:55 9 5 java/ database/ architecture/ scalability/ bigdata

I have cloud statistics (Structured data :: CSV) information; 我有云统计（Structured data :: CSV）信息; which i have to expose to administrator and user. 我必须向管理员和用户公开。

But for scalability; 但是为了可扩展性; data collection will be collected by multiple machines (perf monitor) which is connected with individual DBs. 数据收集将由与各个DB连接的多台机器（性能监视器）收集。

Now Manager (Mgr) is responsible of multicasting the request to all perf monitor; 现在经理（经理）负责向所有性能监测器多播请求; to collect the overall stats data to satisfy single UI request. 收集整体统计数据以满足单个UI请求。

So questions are: 所以问题是：

1) How will i make the mutiple monitor datas to be sorted based on the client request at Mgr. 1）如何根据经理的客户要求对多个监控数据进行排序。 Each monitor may give the result as per the client request; 每个监视器可以根据客户端请求给出结果; but still how to merge multiple machines datas through java? 但仍然如何通过java合并多个机器数据？ Means How to perform in memory sql aggregate/scalar (eg Groupby, orderby, avg) function on all the results retrieved from multiple clusters at MGR. 意味着如何在内存中执行sql聚合/标量（例如，Groupby，orderby，avg）函数对从MGR处的多个聚类中检索到的所有结果。 How do i implement DB sql aggregate/scalar functionality in java side, any known APIs? 如何在java端实现DB sql聚合/标量功能，任何已知的API？ I think what i need is Reduce part of mapreduce technique in hadoop. 我认为我需要的是在hadoop中减少mapreduce技术的一部分。

2) A request from UI (assume select count(*) from DB where Memory > 1000MB) have to be forwarded to multiple machines. 2）来自UI的请求（假设来自DB的选择计数（*），其中内存> 1000MB）必须转发到多台机器。 Now how to send parallel requests to individual monitor and consume only when all the nodes are responded? 现在如何将并行请求发送到单个监视器并仅在响应所有节点时使用？ Means how to wait User thread till consuming all the responses from perf monitors? 意味着如何等待用户线程直到消耗来自perf监视器的所有响应？ How to trigger parallel REST request for single UI request on MGR. 如何在MGR上触发单个UI请求的并行REST请求。

3) Do I have to authenticate UI user at both Mgr and Perf monitor? 3）我是否必须在Mgr和Perf监视器上验证UI用户？

4) Are you thinking any drawback in this approach? 4）你认为这种方法有任何缺点吗？

Notes: 笔记：

1) I didn't go for NoSql because datas are structured and no joins are required. 1）我没有使用NoSql，因为数据是结构化的，不需要连接。

2) I didn't go for node.js since i am new for that and may take more time on developing it. 2）我没有去node.js因为我是新手，可能需要更多时间来开发它。 Also i am not developing any concurrent critical where single threaded are best suited. 此外，我没有开发任何单线程最适合的并发关键。 Here only push/retrieve of data is done. 这里只完成数据的推送/检索。 No modification happening. 没有修改发生。

3) I want individual DB for each monitor OR at-least two instances of DB's with multiple clusters for an instance to support faster accessing of real time BIG statistical data. 3）我希望每个监视器都有单独的数据库，或者至少有两个具有多个集群的DB实例，以支持更快地访问实时BIG统计数据。

5 个解决方案

You want to scale your app, but you designed an inherent bottleneck. 您想扩展您的应用程序，但您设计了一个固有的瓶颈。 Namely: the Mgr. 即：经理。

What I would do is that I would split the Mgr into at least two parts. 我要做的是，我会将经理分成至少两部分。 Front-end and backend. 前端和后端。 The front end could simply be an aggregator and/or controller which collects all the requests from all the different UI servers, timestamps those requests and put them in a queue (RabbitMQ, Kafka, Redis, whatever) making a message with the UI session ID or something similar which uniquely identifies the source of request. 前端可以简单地是聚合器和/或控制器，它收集来自所有不同UI服务器的所有请求，为这些请求添加时间戳并将它们放入队列（RabbitMQ，Kafka，Redis等），使用UI会话ID发送消息或类似的东西，唯一地标识请求的来源。 Then you just have to wait until you get a response on the queue (with a different topic of course). 然后你只需要等到队列得到响应（当然有不同的主题）。

Then on your backend (the other side of the queue) you can set up as many nodes as your load requires and make them performing the same task. 然后在后端（队列的另一端），您可以设置与负载需要的节点数量，并使它们执行相同的任务。 Namely: pull off requests from the queue and call those performance monitoring APIs as necessary. 即：从队列中提取请求并根据需要调用这些性能监视API。 You can scale these backend nodes as much as you wish since they don't have any state, all the state which needs to be stored is already part of the messages in the queue which will be automagically persisted for you by Redis/Kafka/RabbitMQ or whatever else you choose. 您可以根据需要扩展这些后端节点，因为它们没有任何状态，所有需要存储的状态已经是队列中消息的一部分，Redis / Kafka / RabbitMQ将为您自动保留这些消息或者你选择的其他什么。

You can also use Apache Storm or something similar to do this for you in the backend, since it was designed for exactly this kind of applications. 您也可以使用Apache Storm或类似的东西在后端为您执行此操作，因为它专为此类应用程序而设计。

Apache Storm has also built-in merging capability exposed through the Trident API . Apache Storm还具有通过Trident API公开的内置合并功能。

Note on the authentication: you should authenticate the HTTP requests on the front-end side and then you will be all right. 关于身份验证的注意事项：您应该在前端验证HTTP请求，然后您就可以了。 Just assign unique IDs (session IDs most probably) to the users connected to your mgr and use this internal ID when you forward your requests further to downstream servers. 只需为连接到mgr的用户分配唯一ID（最有可能是会话ID），并在将请求转发到下游服务器时使用此内部ID。

Now how to send parallel requests to individual monitor and consume only when all the nodes are responded? 现在如何将并行请求发送到单个监视器并仅在响应所有节点时使用？ Means how to wait User thread till consuming all the responses from perf monitors? 意味着如何等待用户线程直到消耗来自perf监视器的所有响应？ How to trigger parallel REST request for single UI request on MGR. 如何在MGR上触发单个UI请求的并行REST请求。

Well if you have so many questions regarding handling user connections and serving those clients with responses then I would suggest to pick up a book on the Java servlets API. 好吧，如果你有很多关于处理用户连接和为那些客户提供响应的问题，那么我建议你拿一本关于Java servlets API的书。 You might want to read this one for example: Servlet & JSP: A Tutorial (A Tutorial series) . 您可能希望阅读此示例： Servlet和JSP：教程（教程系列）。 It is a bit outdated but well written. 它有点过时但写得很好。

But with all due respect, if you have so many questions on these quite fundamental topics, then it might be better to leave the architecture design to someone more experienced. 但是，如果你对这些非常基本的话题有很多疑问，那么最好将架构设计留给更有经验的人。

不要重新发明轮子，使用一些好的现有BAM和数据库监控工具，它们有很多内置的仪表板和统计信息，易于与Java和工作流程连接。

But for scalability; 但是为了可扩展性; data collection will be collected by multiple machines (perf monitor) which is connected with individual DBs. 数据收集将由与各个DB连接的多台机器（性能监视器）收集。

Approximately what sort of scaling do you anticipate ... is it 100s of GB's Multiple Terra Bytes .... Reason is these days SQL Server and Oracle can handle really large volumes of data. 你预计大概是什么样的缩放...是GB的多个Terra字节的数量......原因是SQL Server和Oracle现在可以处理大量的数据。 Once data is collected in a central db its game over as far as searching and crunching are concerned. 一旦数据被收集在一个中央数据库中，就搜索和处理而言，它的游戏就被关注起来。

Now Manager (Mgr) is responsible of multicasting the request to all perf monitor; 现在经理（经理）负责向所有性能监测器多播请求; to collect the overall stats data to satisfy single UI request. 收集整体统计数据以满足单个UI请求。

This will be a major task to write this and it will be really complex IMHO. 这将是一个重要的任务来写这个，它将是非常复杂的恕我直言。 That said Iam not an expert in this aspect. 那说我不是这方面的专家。

What I would do is to put a layer of Hazelcast or Infinispan or something like this in your Performance Monitor instead of the Hazelcast. 我要做的是在你的性能监视器中放置一层Hazelcast或Infinispan或类似的东西而不是Hazelcast。 The Performance monitor itself like a logic can be part of the DataGrid. 性能监视器本身就像逻辑一样可以是DataGrid的一部分。 Then the MySQL will work as a persistent storage of this data grid. 然后MySQL将作为此数据网格的持久存储。 In this sense you can have more then one Mysql and each mysql will just hold a portion of the data It will just work as extension ability to go beyond your maximum RAM. 从这个意义上讲，你可以拥有多个Mysql，每个mysql只保存一部分数据它只是作为扩展能力超越你的最大RAM。 Overtime you scale your performance monitor you will also scale your persistent capabilities. 超时您扩展性能监视器，您还将扩展您的持久性功能。

Young then Map Reduce or other distributed functions for aggregation can lead to massive amount of paralelism and ability to server significantly more requests. Young然后Map Reduce或其他分布式功能进行聚合可能导致大量的并行性和服务器显着增加请求的能力。 Also such architecture scales horizontal. 这种架构也是水平的。 At the end it should look something like this: 最后它应该看起来像这样：

And just on another note to say that it is not necessary in general to have 1 MySQL for each hazelcast. 而在另一个说明中，一般来说，每个淡化广播都不需要1个MySQL。 That depends on what the goal is. 这取决于目标是什么。 I also kind of forgot the Manager from the diagram but things there are simple it can either work as a gateway to the Data Grid or alternatively it can be merged with the grid. 我也从图中忘记了管理器，但事情很简单，它既可以作为数据网格的网关，也可以与网格合并。

Not sure if my answer would be useful for you since this question has been posted sometimes back. 不确定我的回答是否对您有用，因为此问题有时会被发布。

I would like to answer it based on your question, problems in the current approach and proposed solution... 我想根据你的问题，当前方法中的问题和提出的解决方案来回答这个问题......

1) How will i make the mutiple monitor datas to be sorted based on the client request at Mgr. 1）如何根据经理的客户要求对多个监控数据进行排序。 Each monitor may give the result as per the client request; 每个监视器可以根据客户端请求给出结果; but still how to merge multiple machines datas through java? 但仍然如何通过java合并多个机器数据？ Means How to perform in memory sql aggregate/scalar (eg Groupby, orderby, avg) function on all the results retrieved from multiple clusters at MGR. 意味着如何在内存中执行sql聚合/标量（例如，Groupby，orderby，avg）函数对从MGR处的多个聚类中检索到的所有结果。 How do i implement DB sql aggregate/scalar functionality in java side, any known APIs? 如何在java端实现DB sql聚合/标量功能，任何已知的API？ I think what i need is Reduce part of mapreduce technique in hadoop. 我认为我需要的是在hadoop中减少mapreduce技术的一部分。

Java provided in-build Java DB as part of Java distribution which is also available as Apache Derby database. Java提供内置Java DB作为Java发行版的一部分，也可作为Apache Derby数据库使用。 This database can be used as in-memory SQL database. 此数据库可用作内存中的SQL数据库。 JavaDB & Apache Derby stores the data into disk. JavaDB和Apache Derby将数据存储到磁盘中。 So you won't loose the data after restart. 因此，重启后您不会丢失数据。 Check here http://www.oracle.com/technetwork/java/javadb/overview/index.html https://db.apache.org/derby/ 点击这里http://www.oracle.com/technetwork/java/javadb/overview/index.html https://db.apache.org/derby/

For Map-Reduce simple Java collection based approached would work. 对于Map-Reduce，基于简单的Java集合可以使用。 I don't think you need any special Map-Reduce framework in this case. 在这种情况下，我认为您不需要任何特殊的Map-Reduce框架。 You should however consider Out Of Memory, Network bandwidth etc. when you read data from multiple sources 但是，当您从多个来源读取数据时，您应该考虑内存不足，网络带宽等

2) A request from UI (assume select count(*) from DB where Memory > 1000MB) have to be forwarded to multiple machines. 2）来自UI的请求（假设来自DB的选择计数（*），其中内存> 1000MB）必须转发到多台机器。 Now how to send parallel requests to individual monitor and consume only when all the nodes are responded? 现在如何将并行请求发送到单个监视器并仅在响应所有节点时使用？ Means how to wait User thread till consuming all the responses from perf monitors? 意味着如何等待用户线程直到消耗来自perf监视器的所有响应？ How to trigger parallel REST request for single UI request on MGR. 如何在MGR上触发单个UI请求的并行REST请求。

Ideally NodeJS kind of application are really best suite in this case where application get callback whenever there is a response of the HTTP call. 理想情况下，NodeJS类型的应用程序在这种情况下是最好的套件，只要有HTTP响应，应用程序就会获得回调。 However you can implement Observer Pattern like explained here How do I perform a JAVA callback between classes? 但是你可以像这里解释的那样实现Observer Pattern 如何在类之间执行JAVA回调？

3) Do I have to authenticate UI user at both Mgr and Perf monitor? 3）我是否必须在Mgr和Perf监视器上验证UI用户？

It should be based on your requirement 它应该基于您的要求

4) Are you thinking any drawback in this approach? 4）你认为这种方法有任何缺点吗？

There are several drawbacks with this approach 这种方法有几个缺点

Data should not be pulled on-demand from UI. 不应从UI按需提取数据。 At-least data should be available in the centralised database whenever there is a request to generate the data. 只要有生成数据的请求，至少应该在集中式数据库中提供数据。 Pulling data from various end-points is expensive. 从各个端点提取数据是昂贵的。
Stats must be collected periodically to maintain history and reports must be generated based on the moving time window. 必须定期收集统计信息以维护历史记录，并且必须根据移动时间窗口生成报告。
JVM might go OutOfMemory if large data needs to be process. 如果需要处理大数据，JVM可能会OutOfMemory。 Proper handling is required. 需要妥善处理。
Large data might get transferred over the network every time there is a new request. 每次有新请求时，大数据都可能通过网络传输。 It might be for the same data again. 它可能会再次出现相同的数据。

Notes: 笔记：

1) I didn't go for NoSql because datas are structured and no joins are required. 1）我没有使用NoSql，因为数据是结构化的，不需要连接。

No SQL doesn't mean there is not structure followed. 没有SQL并不意味着没有遵循结构。 Even NoSQL database is the best fit for such data where you don't update the records, transactions etc are not required. 即使NoSQL数据库最适合此类数据，您不需要更新记录，事务等。

2) I didn't go for node.js since i am new for that and may take more time on developing it. 2）我没有去node.js因为我是新手，可能需要更多时间来开发它。 Also i am not developing any concurrent critical where single threaded are best suited. 此外，我没有开发任何单线程最适合的并发关键。 Here only push/retrieve of data is done. 这里只完成数据的推送/检索。 No modification happening. 没有修改发生。

NodeJS won't be a good choice since it is single threaded. NodeJS不是一个好选择，因为它是单线程的。 NodeJS should not be used when you have CPU intensive job to perform. 当您要执行CPU密集型作业时，不应使用NodeJS。 Like yours. 像你的。

3) I want individual DB for each monitor OR at-least two instances of DB's with multiple clusters for an instance to support faster accessing of real time BIG statistical data. 3）我希望每个监视器都有单独的数据库，或者至少有两个具有多个集群的DB实例，以支持更快地访问实时BIG统计数据。

**I would rather suggest you to either store data into any database which can horizontally scale, process the data either as and when it arrives or batch processing so that your user experience is good. **我建议您将数据存储到任何可以水平扩展的数据库中，在数据到达时或者批处理时处理数据，以便您的用户体验良好。 ** **