简体   繁体   English

在 PySpark 中,有没有办法将凭据作为变量传递到 spark.read 中?

[英]In PySpark, is there a way to pass credentials as variables into spark.read?

Spark allows us to read directly from Google BigQuery, as shown below: Spark 允许我们直接从 Google BigQuery 读取数据,如下所示:

df = spark.read.format("bigquery") \
  .option("credentialsFile", "googleKey.json") \
  .option("parentProject", "projectId") \
  .option("table", "project.table") \
  .load()

However having the key saved on the virtual machine, isn't a great idea.然而,将密钥保存在虚拟机上并不是一个好主意。 I have the Google key saved as JSON securely in a credential management tool.我将 Google 密钥安全地保存在凭据管理工具中,保存为 JSON。 The key is read on-demand and saved into a variable called googleKey.密钥按需读取并保存到名为 googleKey 的变量中。

Is it possible to pass JSON into speak.read, or pass in the credentials as a Dictionary?是否可以将 JSON 传递给 speak.read,或者将凭据作为字典传递?

The other option is credentials .另一个选项是credentials From spark-bigquery-connector docs :来自spark-bigquery-connector 文档

How do I authenticate outside GCE / Dataproc?如何在 GCE / Dataproc 之外进行身份验证?

Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration.凭据也可以显式提供,作为参数或从 Spark 运行时配置。 They should be passed in as a base64-encoded string directly.它们应该直接作为 base64 编码的字符串传入。

 // Globally spark.conf.set("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>") // Per read/Write spark.read.format("bigquery").option("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")

This is more like chicken and egg situation.这更像是先有鸡还是先有蛋的情况。 if you are storing credential file in secret manager (hope that's not your credential manager tool).如果您将凭证文件存储在秘密管理器中(希望这不是您的凭证管理器工具)。 How would you access secret manager.您将如何访问秘密管理器。 For that you might need key and where would you store that key.为此,您可能需要密钥以及将该密钥存储在哪里。

For this, Azure has created a managed identities, through which two different services can talk to each other without providing any keys (credential) explicitly.为此,Azure 创建了一个托管身份,两个不同的服务可以通过它相互通信,而无需显式提供任何密钥(凭据)。

If you are running from Dataproc, then the node has a built in service account which you can control on cluster creation.如果您从 Dataproc 运行,则该节点有一个内置服务帐户,您可以在创建集群时控制该帐户。 In this case you do not need to pass any credentials/credentialsFile option.在这种情况下,您不需要传递任何 credentials/credentialsFile 选项。

If you are running on another cloud or on prem, you can use the local secret manager, or implement the connector's AccessTokenProvider which lets you full customization of the credentials creation.如果您在另一个云或本地运行,您可以使用本地秘密管理器,或实施连接器的 AccessTokenProvider,它允许您完全自定义凭据创建。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 spark 上下文初始化后的运行时期间更改 hadoop 配置中 pyspark 的 aws 凭据 - Changing aws credentials in hadoop configuration for pyspark during runtime after initialization of spark context 有没有更好的方法通过PySpark集群(dataporc)将spark df加载到BigQuery中? - Is there a better way to load a spark df into BigQuery through PySpark cluster (dataporc)? 将 AWS 凭证传递给 Docker 容器的最佳方式是什么? - What is the best way to pass AWS credentials to a Docker container? 有没有办法将变量传递给 Jinja2 父母? - Is there a way to pass variables into Jinja2 parents? 在 Pyspark 中读取 Json - Read Json in Pyspark 通过传递凭据在 AWS EMR 上运行 spark - run spark on AWS EMR by passing credentials 读取 PySpark 中的多个文本文件 - Read Multiple Text Files in PySpark 有什么方法可以使用 spark 从 s3 并行读取多个镶木地板路径? - Is there any way to read multiple parquet paths from s3 in parallel using spark? 使用 Spark 读取带有分号的 spark 列 SQL - Read spark column with semicolon using Spark SQL AWS SSO 登录凭证作为环境变量 - AWS SSO login to credentials as environment variables
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM