简体   繁体   English

传统数据湖与 AWS 湖的形成

[英]Traditional Data Lake vs AWS Lake Formation

I have been setting up data lakes for clients wherein we load the data from onprem or any other sources, into the S3 (a data lake).我一直在为客户设置数据湖,我们将本地或任何其他来源的数据加载到 S3(数据湖)中。 We will create an AWS Glue catalog on these raw data to create schemas.我们将在这些原始数据上创建 AWS Glue 目录以创建架构。

The next step would be to either use an EMR or AWS Glue for some data cleansing, load the transformed data into RDS / REDSHIFT / S3 as final target.下一步是使用 EMR 或 AWS Glue 进行一些数据清理,将转换后的数据加载到 RDS / REDSHIFT / S3 作为最终目标。

The jobs can be scheduled using Data pipeline, Glue Jobs, or AWS Lambda event trigger depending on the use case / service used.根据使用的用例/服务,可以使用数据管道、胶水作业或 AWS Lambda 事件触发器来安排作业。

The analysts, other users would be provided required data / S3 bucket access using IAM service for Quicksight visualizations or data querying using Athena, Drill, etc. or use the data for ML applications in Sagemaker.分析师和其他用户将使用 IAM 服务获得所需的数据/S3 存储桶访问权限,以便使用 Athena、Drill 等进行 Quicksight 可视化或数据查询,或将数据用于 Sagemaker 中的 ML 应用程序。

My question is how is AWS Lake Formation different from above traditional Data Lakes?我的问题是 AWS Lake Formation 与传统数据湖有何不同?

I can define that AWS Lake Formation provides all the above services such as S3, Glue Catalog, ETL code generator in Glue, Job scheduler, etc. are available in a single window?我可以定义 AWS Lake Formation 提供的所有上述服务,例如 S3、Glue Catalog、Glue 中的 ETL 代码生成器、作业调度程序等都可以在单个窗口中使用吗? With some more advanced security for users / data (record / column level) that can be configured from within the Lake Formation console.为用户/数据(记录/列级别)提供一些更高级的安全性,可以从 Lake Formation 控制台内进行配置。

Is there anything else that makes Lake formation stand out from the traditional cloud based Data Lake?还有什么可以使 Lake Formation 从传统的基于云的 Data Lake 中脱颖而出?

Thanks谢谢

Your understanding is correct, Lake Formation is essentially just a permissions model over the Glue Catalog that allows close integration with the other AWS data lake tools: Athena, S3, Glue, EMR, etc. As well as some additional features like Blueprints (for syncing data from RDBMS to S3), Jobs (for ETL), and Crawlers (for data discovery).您的理解是正确的,Lake Formation 本质上只是 Glue Catalog 上的权限模型,它允许与其他 AWS 数据湖工具紧密集成:Athena、S3、Glue、EMR 等。以及一些附加功能,如 Blueprints(用于同步数据从 RDBMS 到 S3)、作业(用于 ETL)和爬虫(用于数据发现)。

Lake Formation allows easier permission management for "user" IAM roles in your environment by allowing them to be centrally managed through the Lake Formation UI and API. Lake Formation 允许通过 Lake Formation UI 和 API 集中管理“用户”IAM 角色,从而更轻松地管理环境中的“用户”IAM 角色。 Instead of having to update individual IAM/bucket policies each time a role needs a new access, Lake Formation allows you to onboard a single "service" IAM role to have bucket access and then grant Database/Table/Column level access to the user IAM roles that need it.无需在每次角色需要新访问权限时更新单个 IAM/存储桶策略,Lake Formation 允许您载入单个“服务”IAM 角色以获得存储桶访问权限,然后向用户 IAM 授予数据库/表/列级别的访问权限需要的角色。

The user roles essentially assume the service role to perform their operations (Might not be assume exactly as this is an AWS black-box).用户角色本质上承担服务角色来执行他们的操作(可能不会完全假设,因为这是一个 AWS 黑盒)。 So Lake Formation saves you from the hassle of having to manage permissions for all user IAM roles via a mess of IAM/bucket policies.因此,Lake Formation 可以让您免于通过一堆 IAM/存储桶策略管理所有用户 IAM 角色的权限的麻烦。

It also offers some ease of integration with sharing data to cross account resources if your setup requires it.如果您的设置需要,它还可以轻松集成共享数据以跨帐户资源。

AWS Lake Formation is primarily a Permission control layer which is coupled with AWS Glue to basically provide catalog coupled with permissions control. AWS Lake Formation 主要是一个权限控制层,它与 AWS Glue 相结合,基本上提供与权限控制相结合的目录。 Lake Formation provides reprieve from managing IAM Permissions and instead provides its own Grant based fine grain permission control using simple DB like grants. Lake Formation 提供暂缓管理 IAM 权限的方法,而是使用简单的 DB 类授权提供其自己的基于授权的细粒度权限控制。

Lake Formation still has some challenges with regards to integration with some data services like EMR.(It requires additional IAM policies) But overall using Lake Formation with S3, Glue ETL provides everything needed to build a data lake. Lake Formation 在与 EMR 等一些数据服务的集成方面仍然存在一些挑战。(它需要额外的 IAM 策略)但总体而言,将 Lake Formation 与 S3 结合使用,Glue ETL 提供了构建数据湖所需的一切。

Lake Formation can still benefit from a improved UI and Data Discovery. Lake Formation 仍然可以从改进的 UI 和数据发现中受益。

You can use Lake Formation to implement traditional styled Data Lake or make them more modular and provide support across multiple AWS accounts.您可以使用 Lake Formation 来实施传统样式的 Data Lake,或者使它们更加模块化并提供跨多个 AWS 账户的支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM