简体繁体 English

Databricks、Synapse 和 ADLS gen2 的数据治理解决方案

[英]Data Governance solution for Databricks, Synapse and ADLS gen2

原文 2020-05-11 22:20:36 2 3 azure/ architecture/ databricks/ data-lake/ azure-data-catalog

I'm new to data governance, forgive me if question lack some information.我是数据治理的新手，如果问题缺少一些信息，请原谅我。

Objective客观的

We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform.我们正在 Azure 平台上为中型电信公司从头开始构建数据湖和企业数据仓库。 We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.我们正在使用 ADLS gen2、Databricks 和 Synapse 进行 ETL 处理、数据科学、ML 和 QA 活动。

We already have about a hunder of input tables and 25 TB/yearly.我们已经有大约 100 个输入表和 25 TB/年。 In future we're expecting more.未来我们期待更多。

Business has a strong requirements incline towards cloud-agnostic solutions.企业对与云无关的解决方案有强烈的需求。 Still they are okay with Databricks since it's available on AWS and Azure.他们仍然可以使用 Databricks，因为它在 AWS 和 Azure 上可用。

Question题

What is the best Data Governance solution for our stack and requirements?什么是最适合我们的堆栈和要求的数据治理解决方案？

My workarrounds我的解决方法

I haven't used any data governance solutions yet.我还没有使用过任何数据治理解决方案。 I like AWS Data Lake solution, since it provide basic functionality out-of-the-box.我喜欢AWS Data Lake解决方案，因为它提供了开箱即用的基本功能。 AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2 .据我所知， Azure 数据目录已过时，因为它不支持 ADLS gen2 。

After very quick googling I found three options:经过非常快速的谷歌搜索后，我找到了三个选项：

Databricks Privacera数据块 Privacera
Databricks Immuta Databricks Immuta 数据库
Apache Ranger & Apache Atlas. Apache 游侠 & Apache 阿特拉斯。

Currently I'm not even sure if the 3rd option has full support for our Azure stack.目前我什至不确定第三个选项是否完全支持我们的 Azure 堆栈。 Moreover, it will have much bigger development (infrastructure definition) effort.此外，它将有更大的开发（基础设施定义）工作。 So is there any reasons I should look into Ranger/Atlas direction?那么我有什么理由应该研究 Ranger/Atlas 方向吗？

What are the reasons to prefer Privacera over Immuta and vice versa?更喜欢 Privacera 而不是 Immuta 的原因是什么，反之亦然？

Are there any other options I should evaluate?还有其他我应该评估的选项吗？

What is already done已经完成的事情

From Data Governance perspective we have done only the following things:从数据治理的角度来看，我们只做了以下事情：

Define data zones inside ADLS在 ADLS 中定义数据区
Apply encryption/obfuscation for sensitive data (due to GDPR requirements).对敏感数据应用加密/混淆（由于 GDPR 要求）。
Implemented Row-Level Security (RLS) at Synapse and Power BI layers在 Synapse 和 Power BI 层实施行级安全 (RLS)
Custom audit framework for logging what & when was persisted用于记录持久化内容和时间的自定义审计框架

Things to be done要做的事

Data lineage and single source of truth.数据沿袭和单一事实来源。 Even at 4 months from the start, it become a pain-point to understand dependencies between data sets.即使从一开始的 4 个月，了解数据集之间的依赖关系也成为一个痛点。 The lineage information is stored inside Confluence, it's hard to maintain and continuously update in multiple places.沿袭信息存储在 Confluence 内部，难以在多处维护和持续更新。 Even now it's outdated in some places.即使现在它在某些地方已经过时了。
Security.安全。 Business users may do some data exploration in Databricks Notebooks in future.业务用户将来可能会在 Databricks Notebooks 中进行一些数据探索。 We need RLS for Databricks.我们需要 Databricks 的 RLS。
Data Life Cycle management.数据生命周期管理。
Maybe other data governance related stuff, such as data quality, etc.也许其他数据治理相关的东西，比如数据质量等。

3 个解决方案

I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two.我目前正在探索 Immuta 和 Privacera，因此我还不能详细评论这两者之间的差异。 So far, Immuta gave me better impression with it's elegant policy based setup.到目前为止，Immuta 以其优雅的基于策略的设置给我留下了更好的印象。

Still, there are ways to solve some of the issues you mentioned above without buying an external component:不过，有一些方法可以在不购买外部组件的情况下解决您上面提到的一些问题：

1. Security 1. 安全

For RLS, consider using Table ACLs, and giving access only to certain Hive views.对于 RLS，请考虑使用表 ACL，并仅授予对某些 Hive 视图的访问权限。
For getting access to data inside ADLS, look at enabling password pass-through on clusters.要访问 ADLS 内的数据，请查看在集群上启用密码传递。 Unfortunately, then you disable Scala.不幸的是，你禁用了 Scala。
You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.您仍然需要在 Azure Data Lake Gen 2 上设置权限，这对于授予对现有子项的权限来说是糟糕的体验。
Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.请避免使用列/行子集创建数据集副本，因为数据复制从来都不是一个好主意。

2. Lineage 2. 血统

One option would be to look into Apache Atlas & Spline.一种选择是查看 Apache Atlas & Spline。 Here is one example how to set this up https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7这是一个如何设置的示例https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7
Unfortunately, Spline is still under development, even reproducing the setup mention in the article is not straight forward.不幸的是，Spline 仍在开发中，即使重现文章中提到的设置也不是直截了当的。 Good news that Apache Atlas 3.0 has many available definitions to Azure Data Lake Gen 2 and other sources好消息是 Apache Atlas 3.0 对 Azure Data Lake Gen 2 和其他来源有许多可用的定义
In a few projects, I ended up creating custom logging of reads/writes (seems like you went on this path also).在一些项目中，我最终创建了读/写的自定义日志记录（好像你也走上了这条路）。 Based on these logs, I created a Power BI report to visualize the lineage.基于这些日志，我创建了一个 Power BI 报告来可视化沿袭。
Consider using Azure Data Factory for orchestration.考虑使用 Azure 数据工厂进行编排。 With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities.使用适当的 ADF 管道结构，您可以拥有高级沿袭并帮助您查看依赖项并重新运行失败的活动。 You can read a bit more here: https://mrpaulandrew.com/2020/07/01/adf-procfwk-v1-8-complete-pipeline-dependency-chains-for-failure-handling/您可以在这里阅读更多内容： https://mrpaulandrew.com/2020/07/01/adf-procfwk-v1-8-complete-pipeline-dependency-chains-for-failure-handling/
Take a look at Marquez https://marquezproject.github.io/marquez/ .看看马克斯https://marquezproject.github.io/marquez/ 。 Small open-source library that has some nice features, including data lineage.小型开源库，具有一些不错的功能，包括数据沿袭。

3. Data quality 3.数据质量

Investigate Amazon Deequ - Scala only so far but has some nice predefined data quality functions.调查 Amazon Deequ - Scala 仅到目前为止，但具有一些不错的预定义数据质量功能。
In many projects, we ended up with writing integration tests, checking data quality between moving from bronze (raw) to silver (standardized).在许多项目中，我们最终编写了集成测试，检查从铜牌（原始）到银牌（标准化）之间的数据质量。 Nothing fancy, pure PySpark.没什么特别的，纯粹的 PySpark。

4. Data life cycle management 4. 数据生命周期管理

One option is to use native data lake storage lifecycle management.一种选择是使用本机数据湖存储生命周期管理。 That's not a viable alternative behind Delta/Parquet formats.这不是 Delta/Parquet 格式背后的可行替代方案。
If you use Delta format, you can easier apply retention or pseudoanonymize如果您使用 Delta 格式，您可以更轻松地应用保留或伪匿名化
Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.).第二个选项，假设您有一个表，其中包含有关所有数据集的信息（dataset_friendly_name、路径、保留时间、区域、sensitive_columns、所有者等）。 Your Databricks users use a small wrapper to read/write:您的 Databricks 用户使用小型包装器来读/写：
DataWrapper.Read("dataset_friendly_name") DataWrapper.Read("dataset_friendly_name")
DataWrapper.Write("destination_dataset_friendly_name") DataWrapper.Write("destination_dataset_friendly_name")

It's up to you then to implement the logging, data loading behind the scenes.然后由您来实现后台的日志记录和数据加载。 In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table).此外，您可以跳过 sensitive_columns，基于保留时间的行为（均在数据集信息表中可用）。 Requires quite some effort需要相当的努力

You can always expand this table to more advanced schema, add extra information about pipelines, dependencies, etc. (see 2.4)您始终可以将此表扩展为更高级的模式，添加有关管道、依赖项等的额外信息（请参阅 2.4）

Hopefully you find something useful in my answer.希望你能在我的回答中找到有用的东西。 It would be interesting to know which path you took.知道你走了哪条路会很有趣。

To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks ;为了更好地理解您在 Azure 上为数据治理引用的选项 #2，这里有一个操作指南，演示了在 Databricks 上应用 RLS的经验； a related Databricks video demo ;相关的Databricks 视频演示； and other data governance tutorials .和其他数据治理教程。

Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.全面披露：我的团队为 Immuta 的数据工程师制作内容，我希望这能帮助您节省一些研究时间。

Azure Purview is a new service and it would fit your data governance needs well. Azure Purview 是一项新服务，它可以很好地满足您的数据治理需求。 It is currently (2020-12-04) in public preview.目前 (2020-12-04) 处于公共预览阶段。 It contains features you are looking in your question, eg data lineage, and works well with the Azure services you are using (Synapse, Databricks, ADLSg2).它包含您在问题中查看的功能，例如数据沿袭，并且可以与您正在使用的 Azure 服务（Synapse、Databricks、ADLSg2）配合使用。

Purview is not a cloud agnostic solution. Purview 不是与云无关的解决方案。 It exposes Apache Atlas API so some core capabilies and integrations could be run in any cloud.它公开了 Apache Atlas API 因此一些核心功能和集成可以在任何云中运行。 I would still categorize Purview as Azure specific solution.我仍然会将 Purview 归类为 Azure 特定解决方案。

Purview can manage hybrid data, eg data on-premise or other clouds. Purview 可以管理混合数据，例如本地数据或其他云数据。 This way it is agnostic on where your data is.这样它就不知道你的数据在哪里。 If you need to have some data or use-cases outside Azure, Purview will be able to manage these data assets too.如果您需要在 Azure 之外拥有一些数据或用例，Purview 也可以管理这些数据资产。

I saw that data quality features are on the Purview roadmap and will be available later.我看到 Purview 路线图上有数据质量功能，稍后会提供。 Also other governance topics will be covered later, eg policies.其他治理主题也将在稍后介绍，例如政策。

More info on Purview here: https://azure.microsoft.com/en-us/services/purview/有关 Purview 的更多信息，请访问： https://azure.microsoft.com/en-us/services/purview/