在 PySpark 中，有没有办法将凭据作为变量传递到 spark.read 中？

Question

Spark allows us to read directly from Google BigQuery, as shown below: Spark 允许我们直接从 Google BigQuery 读取数据，如下所示：

df = spark.read.format("bigquery") \
  .option("credentialsFile", "googleKey.json") \
  .option("parentProject", "projectId") \
  .option("table", "project.table") \
  .load()

However having the key saved on the virtual machine, isn't a great idea.然而，将密钥保存在虚拟机上并不是一个好主意。 I have the Google key saved as JSON securely in a credential management tool.我将 Google 密钥安全地保存在凭据管理工具中，保存为 JSON。 The key is read on-demand and saved into a variable called googleKey.密钥按需读取并保存到名为 googleKey 的变量中。

Is it possible to pass JSON into speak.read, or pass in the credentials as a Dictionary?是否可以将 JSON 传递给 speak.read，或者将凭据作为字典传递？

Answer 1

The other option is credentials .另一个选项是credentials 。 From spark-bigquery-connector docs :来自spark-bigquery-connector 文档：

How do I authenticate outside GCE / Dataproc?如何在 GCE / Dataproc 之外进行身份验证？

Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration.凭据也可以显式提供，作为参数或从 Spark 运行时配置。 They should be passed in as a base64-encoded string directly.它们应该直接作为 base64 编码的字符串传入。
 // Globally spark.conf.set("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>") // Per read/Write spark.read.format("bigquery").option("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")

Answer 2

This is more like chicken and egg situation.这更像是先有鸡还是先有蛋的情况。 if you are storing credential file in secret manager (hope that's not your credential manager tool).如果您将凭证文件存储在秘密管理器中（希望这不是您的凭证管理器工具）。 How would you access secret manager.您将如何访问秘密管理器。 For that you might need key and where would you store that key.为此，您可能需要密钥以及将该密钥存储在哪里。

For this, Azure has created a managed identities, through which two different services can talk to each other without providing any keys (credential) explicitly.为此，Azure 创建了一个托管身份，两个不同的服务可以通过它相互通信，而无需显式提供任何密钥（凭据）。

Answer 3

If you are running from Dataproc, then the node has a built in service account which you can control on cluster creation.如果您从 Dataproc 运行，则该节点有一个内置服务帐户，您可以在创建集群时控制该帐户。 In this case you do not need to pass any credentials/credentialsFile option.在这种情况下，您不需要传递任何 credentials/credentialsFile 选项。

If you are running on another cloud or on prem, you can use the local secret manager, or implement the connector's AccessTokenProvider which lets you full customization of the credentials creation.如果您在另一个云或本地运行，您可以使用本地秘密管理器，或实施连接器的 AccessTokenProvider，它允许您完全自定义凭据创建。

在 PySpark 中，有没有办法将凭据作为变量传递到 spark.read 中？

问题描述

3 个解决方案

解决方案1
0 2022-09-22 16:16:52

解决方案2
0 2022-09-22 18:00:09

解决方案3
0 2022-10-05 15:23:48

在 PySpark 中，有没有办法将凭据作为变量传递到 spark.read 中？

问题描述

3 个解决方案

解决方案1 0 2022-09-22 16:16:52

解决方案2 0 2022-09-22 18:00:09

解决方案3 0 2022-10-05 15:23:48

解决方案1
0 2022-09-22 16:16:52

解决方案2
0 2022-09-22 18:00:09

解决方案3
0 2022-10-05 15:23:48