[英]Data Governance solution for Databricks, Synapse and ADLS gen2
I'm new to data governance, forgive me if question lack some information.我是数据治理的新手,如果问题缺少一些信息,请原谅我。
We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform.我们正在 Azure 平台上为中型电信公司从头开始构建数据湖和企业数据仓库。 We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.
我们正在使用 ADLS gen2、Databricks 和 Synapse 进行 ETL 处理、数据科学、ML 和 QA 活动。
We already have about a hunder of input tables and 25 TB/yearly.我们已经有大约 100 个输入表和 25 TB/年。 In future we're expecting more.
未来我们期待更多。
Business has a strong requirements incline towards cloud-agnostic solutions.企业对与云无关的解决方案有强烈的需求。 Still they are okay with Databricks since it's available on AWS and Azure.
他们仍然可以使用 Databricks,因为它在 AWS 和 Azure 上可用。
What is the best Data Governance solution for our stack and requirements?什么是最适合我们的堆栈和要求的数据治理解决方案?
I haven't used any data governance solutions yet.我还没有使用过任何数据治理解决方案。 I like AWS Data Lake solution, since it provide basic functionality out-of-the-box.
我喜欢AWS Data Lake解决方案,因为它提供了开箱即用的基本功能。 AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2 .
据我所知, Azure 数据目录已过时,因为它不支持 ADLS gen2 。
After very quick googling I found three options:经过非常快速的谷歌搜索后,我找到了三个选项:
Currently I'm not even sure if the 3rd option has full support for our Azure stack.目前我什至不确定第三个选项是否完全支持我们的 Azure 堆栈。 Moreover, it will have much bigger development (infrastructure definition) effort.
此外,它将有更大的开发(基础设施定义)工作。 So is there any reasons I should look into Ranger/Atlas direction?
那么我有什么理由应该研究 Ranger/Atlas 方向吗?
What are the reasons to prefer Privacera over Immuta and vice versa?更喜欢 Privacera 而不是 Immuta 的原因是什么,反之亦然?
Are there any other options I should evaluate?还有其他我应该评估的选项吗?
From Data Governance perspective we have done only the following things:从数据治理的角度来看,我们只做了以下事情:
I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two.我目前正在探索 Immuta 和 Privacera,因此我还不能详细评论这两者之间的差异。 So far, Immuta gave me better impression with it's elegant policy based setup.
到目前为止,Immuta 以其优雅的基于策略的设置给我留下了更好的印象。
Still, there are ways to solve some of the issues you mentioned above without buying an external component:不过,有一些方法可以在不购买外部组件的情况下解决您上面提到的一些问题:
1. Security 1. 安全
For RLS, consider using Table ACLs, and giving access only to certain Hive views.对于 RLS,请考虑使用表 ACL,并仅授予对某些 Hive 视图的访问权限。
For getting access to data inside ADLS, look at enabling password pass-through on clusters.要访问 ADLS 内的数据,请查看在集群上启用密码传递。 Unfortunately, then you disable Scala.
不幸的是,你禁用了 Scala。
You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.您仍然需要在 Azure Data Lake Gen 2 上设置权限,这对于授予对现有子项的权限来说是糟糕的体验。
Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.请避免使用列/行子集创建数据集副本,因为数据复制从来都不是一个好主意。
2. Lineage 2. 血统
3. Data quality 3.数据质量
4. Data life cycle management 4. 数据生命周期管理
One option is to use native data lake storage lifecycle management.一种选择是使用本机数据湖存储生命周期管理。 That's not a viable alternative behind Delta/Parquet formats.
这不是 Delta/Parquet 格式背后的可行替代方案。
If you use Delta format, you can easier apply retention or pseudoanonymize如果您使用 Delta 格式,您可以更轻松地应用保留或伪匿名化
Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.).第二个选项,假设您有一个表,其中包含有关所有数据集的信息(dataset_friendly_name、路径、保留时间、区域、sensitive_columns、所有者等)。 Your Databricks users use a small wrapper to read/write:
您的 Databricks 用户使用小型包装器来读/写:
DataWrapper.Read("dataset_friendly_name") DataWrapper.Read("dataset_friendly_name")
DataWrapper.Write("destination_dataset_friendly_name") DataWrapper.Write("destination_dataset_friendly_name")
It's up to you then to implement the logging, data loading behind the scenes.然后由您来实现后台的日志记录和数据加载。 In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table).
此外,您可以跳过 sensitive_columns,基于保留时间的行为(均在数据集信息表中可用)。 Requires quite some effort
需要相当的努力
Hopefully you find something useful in my answer.希望你能在我的回答中找到有用的东西。 It would be interesting to know which path you took.
知道你走了哪条路会很有趣。
To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks ;为了更好地理解您在 Azure 上为数据治理引用的选项 #2,这里有一个操作指南,演示了在 Databricks 上应用 RLS的经验; a related Databricks video demo ;
相关的Databricks 视频演示; and other data governance tutorials .
和其他数据治理教程。
Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.全面披露:我的团队为 Immuta 的数据工程师制作内容,我希望这能帮助您节省一些研究时间。
Azure Purview is a new service and it would fit your data governance needs well. Azure Purview 是一项新服务,它可以很好地满足您的数据治理需求。 It is currently (2020-12-04) in public preview.
目前 (2020-12-04) 处于公共预览阶段。 It contains features you are looking in your question, eg data lineage, and works well with the Azure services you are using (Synapse, Databricks, ADLSg2).
它包含您在问题中查看的功能,例如数据沿袭,并且可以与您正在使用的 Azure 服务(Synapse、Databricks、ADLSg2)配合使用。
Purview is not a cloud agnostic solution. Purview 不是与云无关的解决方案。 It exposes Apache Atlas API so some core capabilies and integrations could be run in any cloud.
它公开了 Apache Atlas API 因此一些核心功能和集成可以在任何云中运行。 I would still categorize Purview as Azure specific solution.
我仍然会将 Purview 归类为 Azure 特定解决方案。
Purview can manage hybrid data, eg data on-premise or other clouds. Purview 可以管理混合数据,例如本地数据或其他云数据。 This way it is agnostic on where your data is.
这样它就不知道你的数据在哪里。 If you need to have some data or use-cases outside Azure, Purview will be able to manage these data assets too.
如果您需要在 Azure 之外拥有一些数据或用例,Purview 也可以管理这些数据资产。
I saw that data quality features are on the Purview roadmap and will be available later.我看到 Purview 路线图上有数据质量功能,稍后会提供。 Also other governance topics will be covered later, eg policies.
其他治理主题也将在稍后介绍,例如政策。
More info on Purview here: https://azure.microsoft.com/en-us/services/purview/有关 Purview 的更多信息,请访问: https://azure.microsoft.com/en-us/services/purview/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.