简体   繁体   English

Amazon Redshift - 复制 - 数据加载与查询性能问题

[英]Amazon Redshift - Replication - Data load Vs Query Performance Issues

We are in the process of migrating our data warehouse from Oracle to Redshift. 我们正在将数据仓库从Oracle迁移到Redshift。 Currently we have two instances of Oracle database - one DW instance (Primary) gets data loaded from different sources throughout the day and another DW (Secondary) instance replicating the data from the primary DW. 目前,我们有两个Oracle数据库实例 - 一个DW实例(主要)获取从一天中不同来源加载的数据,另一个DW(辅助)实例从主DW中复制数据。 All reporting platforms point to the Secondary DW instance. 所有报告平台都指向辅助DW实例。 How can we address this in Redshift? 我们如何在Redshift中解决这个问题? Should we need to have two instances of Redshift one replicating from the other? 我们是否需要有两个Redshift实例从另一个实例复制? If we have just one Redshift instance will the data load overhead affects the query performance. 如果我们只有一个Redshift实例,那么数据加载开销会影响查询性能。 Will there be table locks issue? 会有表锁问题吗?

Appreciate your suggestions. 感谢您的建议。 Thanks. 谢谢。

It really depends how quickly your reporting platforms need access to the data that is loaded throughout the day. 这实际上取决于您的报告平台需要多快访问全天加载的数据。 If it can wait, then it makes sense to batch load during quiet hours. 如果它可以等待,那么在安静的时间批量加载是有意义的。 I suspect from the fact that you're using replication in your current setup, that you require the data to be loaded and available as soon as possible. 我怀疑您在当前设置中使用复制,您需要尽快加载和提供数据。

In that case, it would make sense to utilise Redshift's Workload Management (WLM) settings. 在这种情况下,使用Redshift的工作负载管理(WLM)设置是有意义的。 This allows you to designate multiple workload groups, and allocate a concurrency level and cluster resource allocation to each. 这允许您指定多个工作负载组,并为每个工作负载组分配并发级别和集群资源分配。 Using this model, you can ring-fence resources to ensure your query performance for your reporting tools and end-users is guaranteed a consistent allocation of resources, while still dedicating a portion of the cluster's query queue and resources to your data loads. 使用此模型,您可以对资源进行隔离,以确保报告工具和最终用户的查询性能保证一致的资源分配,同时仍将集群的一部分查询队列和资源专用于数据加载。

This would also eliminate the need for having two separate database instances to handle loading and serving data. 这也消除了使用两个单独的数据库实例来处理加载和提供数据的需要。

See here for more detail on WLM in Redshift: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html 有关Redshift中WLM的更多详细信息,请参见此处: http//docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Never ever Read and Write from the same instance. 永远不会从同一个实例读取和写入。 Not even in Redshift. 甚至没有在Redshift。 Even in general, any system that forces you to read and write from the same machine reflects a poor design. 即使一般而言,任何强制您从同一台机器读写的系统都会反映出糟糕的设计。

Since you are discussing about Amazon Redshift, I can very comfortably assume that you have analytical data. 由于您正在讨论Amazon Redshift,我可以非常轻松地假设您拥有分析数据。 (Redshift having a columnar architecture is optimised for reads and not write. So if you happen to store transactional data on Redshift, I would recommend you to reconsider your decision). (具有柱状架构的Redshift针对读取而非写入进行了优化。因此,如果您碰巧在Redshift上存储事务数据,我建议您重新考虑您的决定)。

Before designing any infra about we talk about analytical data, we should always consider that: 在设计任何关于分析数据的基础知识之前,我们应该始终考虑:

  1. It'll be voluminous. 这将是浩繁的。
  2. and it'll be further scaled in the near future. 它将在不久的将来进一步扩大规模。

When you scale, reading and writing from the same machine will be catastrophic. 当您扩展时,从同一台机器读取和写入将是灾难性的。 And not to forget the locks. 而不是忘记锁。 Delete / Truncate will hold Exclusive Locks on the table. 删除/截断将在表格中保留独占锁定 If it happens that some other process user has already acquired this lock, then even the write on that table will fail, messing up the data. 如果发生某些其他进程用户已经获得此锁定,则即使该表上的写入也将失败,从而弄乱数据。

The above reasons might be convincing enough on why not to use a single warehouse to read / write data. 上述原因可能足以说明为什么使用单个仓库来读/写数据。

Follow the below model, which is neat and clean, will never interfere and will ensure you don't face issues of consistency and locks etc: 遵循以下模型,整洁干净,永不干扰,并确保您不会遇到一致性和锁定等问题:

 +
 |
 |
 |  DS 1     +------------+            +------------+
 +---------> |            |            |            |
             |            | AGGREGATES |            |     reads
    DS 2     |   DW 1     +----------> |    DW 2    | +----------->
+----------> |            |            |            |
             |            |            |            |
+----------> +------------+            +------------+
|... DS n
|
+
where DS : Data Source , DW : Data Warehouse

The migration of data from DW 1 --> DW 2 will completely depend upon how frequent data you have to refer. 从DW 1 - > DW 2迁移数据将完全取决于您需要引用的数据频率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM