简体   繁体   English

网站点击 stream 流 + 客户 360 使用 AWS Kinesis Firehose

[英]Website Click stream flow + customer 360 using AWS Kinesis Firehose

We are trying to implement a clickstream flow of our e-commerce on AWS.我们正在尝试在 AWS 上实现我们电子商务的点击流。 The clickstream will catch all actions done by 'Anonymous' users.点击流将捕获“匿名”用户所做的所有操作。 Anonymous users are tracked via a UUID, generated during their first visit, that is stored in a cookie.匿名用户通过 UUID 进行跟踪,该 UUID 在他们第一次访问期间生成,存储在 cookie 中。 We used the AWS example here to suggest a solution architecture like the diagram below:我们在这里使用 AWS 示例来建议一个解决方案架构,如下图所示:

在此处输入图像描述

Now 2 questions:现在2个问题:

  1. Different pages in the e-commerce have different clickstream data.电子商务中不同的页面有不同的点击流数据。 For example on the Item view page, we would like to send Item related info such as itemId as well.例如,在 Item 视图页面上,我们也想发送 Item 相关信息,例如 itemId。 Or on Checkout page, we would like to have few order related info tied to the clickstream data.或者在结帐页面上,我们希望很少有与点击流数据相关的订单相关信息。 Should we have separate Firehose delivery streams for different pages to support custom clickstream data?我们是否应该为不同的页面设置单独的 Firehose 传输流来支持自定义点击流数据? Or we should send a generic clickstream record (with possible null values for some attributes) to a FH delivery stream?或者我们应该向 FH 交付 stream 发送一个通用点击流记录(对于某些属性可能具有 null 值)?

  2. At some point our anonymous users become identified (ex. they login so we know their User_ID) So we would like to link the {UUID and User_ID} to be able to have a customer 360 view.在某些时候,我们的匿名用户会被识别(例如,他们登录,所以我们知道他们的 User_ID)所以我们希望链接 {UUID 和 User_ID} 以便能够获得客户 360 度视图。 Should we consider a separate stream flow + separate S3 bucket for tracking UUID+ User_ID mappings?我们是否应该考虑使用单独的 stream 流 + 单独的 S3 存储桶来跟踪 UUID + User_ID 映射? Should we then use Athena for showing aggregated reports for customer 360?那么我们是否应该使用 Athena 来显示客户 360 的汇总报告? Should we aggregate the data and create a customer dimension in the Redshift?我们是否应该聚合数据并在 Redshift 中创建客户维度? What would be a good solution for this?对此有什么好的解决方案?

Regards, Lina问候,丽娜

[Update]: Is the following diagram an acceptable solution for the question? [更新]:下图是该问题的可接受解决方案吗? 在此处输入图像描述

You should make the decision based on how you intend to access the data.您应该根据您打算如何访问数据来做出决定。 Given the rate at which clickstream data grows, if you want to generate any meaningful insights on the data with a reasonable response time and cost, you would want to make use of data partitioning.鉴于点击流数据的增长速度,如果您想以合理的响应时间和成本对数据产生任何有意义的见解,您将需要使用数据分区。 Read more about this here . 在此处阅读有关此内容的更多信息。

To be able to reliably do that you will have to use multiple Kinesis streams.为了能够可靠地做到这一点,您将不得不使用多个 Kinesis 流。

The only scenario in which you would choose to not use multiple streams is due to cost.您选择不使用多个流的唯一情况是成本问题。 But given that you will be using it in a clickstream application, and if you are using it on a website with active users, the number of incoming events can be easily used to effectively use the shards.但是考虑到您将在点击流应用程序中使用它,并且如果您在具有活跃用户的网站上使用它,则可以轻松地使用传入事件的数量来有效地使用分片。

Disclaimer: Personal Opinion: I would suggest that you move this to Kinesis Firehose just so that you will have the flexibility to start loading the data into redhift with minimal process changes at later stage while at the same time also backup the data in S3 for cold storage/backup.免责声明:个人意见:我建议您将其移至 Kinesis Firehose,以便您可以灵活地开始将数据加载到 redhift 中,并在后期进行最小的流程更改,同时还可以备份 S3 中的数据以备不时之需存储/备份。 Given the volume, Athena might not be a good choice for performing analytical queries on the data.考虑到数据量,Athena 可能不是对数据执行分析查询的好选择。 You can look at using Redhift external tables, wherein the data still lies on S3.您可以查看使用 Redhift 外部表,其中数据仍位于 S3 上。 As for the cost of the redshift instance itself, you can now pause the cluster.至于 redshift 实例本身的成本,您现在可以暂停集群。 Read the announcement here .此处阅读公告。

To address the updated architecture diagram that you have added, you can completely skip Glue.要解决您添加的更新架构图,您可以完全跳过 Glue。 Kinesis can directly load the data to S3 and you can define external tables with RedShift spectrum. Kinesis 可以直接将数据加载到 S3,您可以使用 RedShift 频谱定义外部表。

The general approach is to load the data into Redshift as well as back it up to S3.一般的方法是将数据加载到 Redshift 并将其备份到 S3。 Then on the Redshift you can periodically delete old data (say more than a year ago).然后在 Redshift 上,您可以定期删除旧数据(比如一年多前)。 This balances the cost vs performance as the query would be more performance on the data lying with the Redshift.这平衡了成本与性能,因为查询将对 Redshift 的数据具有更高的性能。

As for transformations, you can directly make use of Lambdas with the Kinesis Firehose.至于转换,您可以直接使用带有 Kinesis Firehose 的 Lambdas。 Read more about this here . 在此处阅读有关此内容的更多信息。

Edit 1: Added the opinion around using Redshift and why it will be useful and cost effective编辑 1:添加了关于使用 Redshift 的意见以及为什么它有用且具有成本效益

Edit 2: Added details around simplifying the new proposed architecture.编辑 2:添加了有关简化新提议架构的详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 AWS:在不同账户中使用 Kinesis Firehose 读取 Kinesis Stream 数据 - AWS: reading Kinesis Stream data using Kinesis Firehose in a different account AWS Put Subscription Filter for Kinesis Firehose using Cloudformation - 检查给定的 Firehose stream 是否处于活动状态 state - AWS Put Subscription Filter for Kinesis Firehose using Cloudformation - Check if the given Firehose stream is in ACTIVE state 关于 Kinesis Firehose 数据 stream 到 AWS Lambda - Regarding KInesis Firehose data stream to AWS Lambda 使用Firehose将Kinesis Stream转换为S3备份 - Kinesis Stream to S3 Backup using Firehose AWS Kinesis Firehose-使用索引旋转(Elasticsearch) - AWS Kinesis Firehose - using Index Rotation (Elasticsearch) 从数据 stream (Kinesis) 到 OpenSearch AWS 创建交付 stream (Firehose) - Create delivery stream (Firehose) from data stream (Kinesis) to OpenSearch AWS 如何更改现有的AWS Kinesis Firehose交付流的目的地 - How to change destination for an existing aws kinesis Firehose delivery stream AWS Kinesis Firehose stream 是否会覆盖表上的锁 - Does AWS Kinesis Firehose stream overrides a LOCK on table 如何使用terraform将kinesis流与firehose传输流连接 - How to connect a kinesis stream with a firehose delivery stream using terraform Kinesis Firehose Delivery Stream 在使用 cloudformation 时无法承担角色 - Kinesis Firehose Delivery Stream is unable to assume role while using cloudformation
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM