简体繁体 English

Synapse Pipeline - 如何指定 Pipeline Run 的标识？ (SP/UAMI 等)

[英]Synapse Pipeline - How do I specify identity for Pipeline Run? (SP/UAMI etc)

原文 2022-12-23 15:58:22 0 2 azure/ azure-active-directory/ azure-data-factory/ azure-synapse

I am working with Synpase Spark Pools in a controlled corporate environment.我在受控的企业环境中使用 Synpase Spark Pools。 I have limited permission to query AAD but I can create UAMIs and assign them to Resources.我查询 AAD 的权限有限，但我可以创建 UAMI 并将它们分配给资源。

When I access my Synpase workspace I can create a Spark Job Definition to read some data from ADLS.当我访问我的 Synpase 工作区时，我可以创建一个 Spark 作业定义来从 ADLS 读取一些数据。 Looking at the Apache Spark Applications list under the Monitor tab I can see that these jobs use my identity (tim.davies@work.com) as the 'Submitter', and since I have given myself rx access to the data store these succeed.查看 Monitor 选项卡下的 Apache Spark Applications 列表，我可以看到这些作业使用我的身份 (tim.davies@work.com) 作为“提交者”，并且由于我已经为自己授予了对数据存储的 rx 访问权限，所以这些工作成功了。

Now if i create a Pipeline, and configure it to run my Spark Job Definition, it fails with an authorisation error.现在，如果我创建一个管道，并将其配置为运行我的 Spark 作业定义，它会因授权错误而失败。 Going back to Apache Spark Applications list under Monitor I see that my Pipeline has a different Identity used as Submitter, which would explain why it is not authorised to access the data.回到 Apache Spark Applications list under Monitor 我看到我的管道有一个不同的身份用作提交者，这可以解释为什么它没有被授权访问数据。

Firstly, I'm not sure which identity is now being used as Submitter, I don't recognise the UUID as either my Synapse Workspace SAMI or UAMI, (but I can't query AAD for more info).首先，我不确定现在使用哪个身份作为提交者，我无法将 UUID 识别为我的 Synapse Workspace SAMI 或 UAMI（但我无法查询 AAD 以获取更多信息）。

However in general it occurs to me that I would probably like to be able to assign explicit UAMIs for my Pipelines to run under.但是总的来说，我突然想到我可能希望能够为我的管道分配明确的 UAMI 以在其下运行。 Is this possible?这可能吗？ Or is there a different model for managing this?或者是否有不同的 model 来管理这个？

2 个解决方案

As I understand the ask here is to know how to read the data from ADLS from a spark job.据我了解，这里的问题是了解如何从 Spark 作业中读取来自 ADLS 的数据。 Since you have the access the the ADLS, so works fine.由于您可以访问 ADLS，因此工作正常。 I thnk you will have to set up the permission for the Synapse Workspace on the ADLS and it should work fine.我认为您必须在 ADLS 上设置 Synapse 工作区的权限，它应该可以正常工作。

Slightly slow update to this but I've arrived at something of an answer in terms of understanding, if not quite a solution.对此的更新速度稍慢，但我已经在理解方面找到了一些答案，如果不是一个解决方案的话。 Will be useful to share here for anyone following or looking into the same questions.对于关注或研究相同问题的任何人，在此处分享将很有用。

First of all, when accessing the Synapse Workspace through the Portal/UI the actionable Identity that is used by Notebooks or a Standalone 'Apache Spark Job Definition', is the Identity of the User that is logged in, (via ' AAD Passthrough ').首先，当通过门户/UI 访问 Synapse 工作区时，笔记本或独立“Apache Spark 作业定义”使用的可操作身份是登录用户的身份（通过“ AAD 直通”） . This is great for user experience, especially in Notebooks, and you just need to make sure that you as the individual have personal access to any data sources you use.这对用户体验非常有用，尤其是在 Notebooks 中，您只需要确保您作为个人对您使用的任何数据源都有个人访问权限。 In some cases, where your user identity doesn't have this access, you could make use of a Workspace Linked Service identity instead, but not always!在某些情况下，如果您的用户身份没有此访问权限，您可以改用Workspace 链接服务身份，但并非总是如此！ (keep reading) （继续阅读）

Once you switch to using Pipelines however, the Identity used is the System Assigned Managed Identity (SAMI) of the workspace , which is created and assigned at resource creation.但是，一旦您切换到使用管道，所使用的标识就是工作区的系统分配托管标识 (SAMI) ，它是在资源创建时创建和分配的。 This is ok, but it is important to understand the granularity, ie.这没关系，但了解粒度很重要，即。 it is the Workspace that has access to resources, not individual Pipelines.有权访问资源的是工作区，而不是单个管道。 Therefore if you want to run Pipelines with different levels of access, you will need to deploy them to segregated Synapse Workspaces, (with distinct SAMIs).因此，如果您想运行具有不同访问级别的管道，您将需要将它们部署到隔离的 Synapse 工作区（具有不同的 SAMI）。

One aside on this is the identity of the ' Submitter ' that I mentioned in my original question, which is visible under the monitor tab of the Synapse workspace for all Apache Spark applications.其中之一是我在原始问题中提到的“提交者”的身份，它在所有 Apache Spark 应用程序的 Synapse 工作区的监视器选项卡下可见。 When running as the user (eg. Notebooks), this submitter ID is my AAD username, which is straightforward.当以用户（例如 Notebooks）身份运行时，此提交者 ID 是我的 AAD 用户名，这很简单。 However when running as a pipeline the Submitter ID is ' ee20d9e7-6295-4240-ba3f-c3784616c565 ', and I mean literally this same UUID for EVERYONE .但是，当作为管道运行时，提交者 ID 是“ ee20d9e7-6295-4240-ba3f-c3784616c565 ”，我的意思是每个人都使用相同的 UUID。 It turns out this is the id of ADF as an enterprise application.原来这是ADF作为企业应用的id。 Not very useful, compared to putting the Workspace SAMI in here for example, but that's what it is in case anyone else is drifting down that rabbit hole!例如，与将 Workspace SAMI 放在这里相比，这不是很有用，但这是为了防止其他人掉进那个兔子洞！

You can create and assign an additional User Assigned Managed Identity (UAMI) to the Workspace, but this will be not be used by an executing pipeline.您可以创建一个额外的用户分配托管身份 (UAMI) 并将其分配给工作区，但这不会被执行管道使用。 The UAMI can be used by a Workspace Linked Service, but that has some of its own limitations (mentioned below). UAMI 可由 Workspace 链接服务使用，但它有一些自身的限制（如下所述）。 Also my experience is that a UAMI assigned at workspace creation will not be correctly 'associated' to the Workspace until I manually create a 2nd UAMI in the portal.另外我的经验是，在创建工作区时分配的 UAMI 将无法正确“关联”到工作区，直到我在门户中手动创建第二个 UAMI。 I haven't gone deep into this as turns out UAMIs are no good to me but seems like a straightforward bug.我没有深入研究这个问题，因为事实证明 UAMI 对我没有好处，但似乎是一个简单的错误。

Now my specific use case is for running Apache Spark Applications in Synapse Pipelines, and the straightforward way to make this work is to make sure the Workspace SAMI has access to required resources and you're good to go. If you just want to make it work then do this and stop here, but if you want to look a little deeper carry on...现在我的具体用例是在 Synapse Pipelines 中运行 Apache Spark 应用程序，使这项工作正常进行的直接方法是确保 Workspace SAMI 可以访问所需的资源，并且您可以使用 go。如果您只是想实现它工作然后做这个并停在这里，但如果你想看得更深一点继续......

The suggestion in some of the Microsoft documentation is that you should be able to use a Workspace Linked Service within a Spark Application in order to get access to Resources.某些Microsoft 文档中的建议是，您应该能够在 Spark 应用程序中使用工作区链接服务，以便访问资源。 However this doesn't work, I've been discussing the same with Microsoft and they have confirmed the same and are investigating.但这不起作用，我一直在与 Microsoft 讨论相同的问题，他们已经确认并正在调查。 So at this point it's worth noting the date ( 02/02/2023 - handily unambiguous for American readers;-)), because the issue may later be resolved.所以在这一点上值得注意的是日期（02/02/ 2023——对美国读者来说很容易明确;-)），因为这个问题可能会在以后得到解决。 But right now your only option in your Spark code is to fall back on the user/workspace identities.但是现在您在 Spark 代码中的唯一选择是退回到用户/工作区身份。

Just a thought on why this matters, it is not really for segregation since any resource running in the Workspace can access any Linked Service.只是想一想为什么这很重要，它并不是真正的隔离，因为工作区中运行的任何资源都可以访问任何链接服务。 It is really more a question of Identity and Resource Management, ie.这实际上更像是身份和资源管理的问题，即。 it would be better to separate the Identities being used and assigned to Resources for access from the Resources themselves.最好将正在使用和分配给资源的身份与资源本身分开。 In most cases we'd rather do this with groups that individual identities, and if the management processes are long-winded (mine are) then I'd rather not have to repeat them every time I create a resource.在大多数情况下，我们宁愿对具有个人身份的组执行此操作，如果管理过程冗长（我的），那么我宁愿不必在每次创建资源时都重复它们。

Anyway that's enough for now, will update if this changes while I'm still paying attention...无论如何，现在就足够了，如果在我仍在关注的情况下发生变化，将会更新...