简体繁体 English

需要实时分析方法

[英]Need approach on real time analytics

原文 2022-09-17 10:29:18 4 1 analytics/ reporting/ dashboard/ data-warehouse/ business-intelligence

My team wants to provide dashboards to users(store owners) with kpis specific to them, for eg orders placed in their stores, total orders cancelled, total sales, deliveries completed.我的团队希望向用户（商店所有者）提供仪表板，其中包含特定于他们的 kpi，例如在他们的商店中放置的订单、取消的总订单、总销售额、完成的交付。 etc. This dashboard should be updated near real time (say delay of 15 mins).等等。这个仪表板应该几乎实时更新（比如延迟 15 分钟）。 It will be integrated with the core app where orders are placed.它将与下订单的核心应用程序集成。

One route we are assessing is to to replicate all required tables to a warehouse, lets say a db for analytics, join tables, keep the grain at transaction level and save this massive table to the same warehouse.我们正在评估的一种方法是将所有需要的表复制到一个仓库，比如说一个用于分析的数据库、连接表、将粒度保持在事务级别并将这个庞大的表保存到同一个仓库。 All these steps should be done in 15 mins.所有这些步骤应在 15 分钟内完成。 The dashboard will then calculate kpis over this massive table.然后仪表板将在这个庞大的表上计算 kpi。 I think this is not the most efficient way.我认为这不是最有效的方法。

My inclination is to not warehouse the tables and directly let the dashboards query on live tables.我的倾向是不存储表，直接让仪表板查询实时表。 as per the current session, the filter at user level will automatically eliminate majority of records.根据当前的 session，用户级别的过滤器将自动消除大部分记录。 So it would be fast enough.所以它会足够快。

Streaming tools like Spark streaming + Kafka may not help as i have to listen to many tables for updates and then aggregate.像 Spark 流 + Kafka 这样的流工具可能无济于事，因为我必须听许多表进行更新然后聚合。

Any suggestions on what could be the best approach?关于什么可能是最好的方法的任何建议？ Please help请帮忙

1 个解决方案

Doing analytics on your production transactional system is generally a bad idea (and why we have separate warehouse systems) because对您的生产事务系统进行分析通常是一个坏主意（以及为什么我们有单独的仓库系统），因为

It can affect the performance of the transactional processes它会影响事务流程的性能
Data structures that effectively support OLTP tend to be very different to those that effectively support OLAP有效支持 OLTP 的数据结构往往与有效支持 OLAP 的数据结构大不相同

As you would only load the deltas into your warehouse and calculate your KPIs against them, and not your whole dataset, then you shouldn't have performance issues.由于您只会将增量加载到仓库中并针对它们计算 KPI，而不是整个数据集，因此您不应该遇到性能问题。 This is a pretty common pattern这是一个很常见的模式

But as David mentions, this is really just opinions as you don't provide enough information and are asking a very broad question但是正如大卫所说，这实际上只是意见，因为您没有提供足够的信息并且提出了一个非常广泛的问题