简体   繁体   English

如何将 documentdb 连接到 emr 实例中的 spark 应用程序

[英]How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance.我在我的 EMR 实例中尝试使用 mongodb 配置 spark 时遇到错误。 Below is the command -下面是命令 -

spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!@docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3

I'm a beginner in Spark & AWS.我是 Spark 和 AWS 的初学者。 Can anyone please help?有人可以帮忙吗?

DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. DocumentDB 需要在将启动 spark 执行程序的每个节点上安装 CA 包。 As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.因此,您首先需要在每个实例上安装 CA 证书,AWS 在 JAVA 部分下有两个 bash 脚本中的指南,这使事情变得更容易。 1 1个

Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark.安装这些证书后,您的 spark 命令需要使用您可以传递给 Spark 的配置参数来引用信任库及其密码。 Here is an example that I ran and this worked fine.这是我运行的示例,效果很好。

spark-submit 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3  
--conf "spark.executor.extraJavaOptions=  
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks  
-Djavax.net.ssl.trustStorePassword=<yourpassword>"   pytest.py

you can provide those same configuration options in both spark-shell as well.您也可以在两个 spark-shell 中提供相同的配置选项。

One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.我发现有一件事很棘手,那就是 mongo spark 连接器似乎不知道连接字符串中的 ssl_ca_certs 参数,所以我删除了它以避免来自 Spark 的警告,因为 Spark 执行程序无论如何都会在配置中引用密钥库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将锁孔连接到 AWS DocumentDB? - How to connect keyhole to AWS DocumentDB? 如何将 AWS Elastic Bean Stalk 连接到 DocumentDB - How to connect AWS Elastic Bean Stalk to DocumentDB 如何解决 EMR Spark Out Of Memory 错误 - How to resolve EMR Spark Out Of Memory Error Spark Application Master是否一直运行在EMR集群的master节点 - Does Spark Application Master always run in the master node of EMR cluster or not 如何使用 Terraform 在 EMR 上安装 Spark,Hadoop? - How to install Spark, Hadoop on EMR using Terraform? 如何减少 EMR 中 Apache Spark 的日志? - How to reduce logs for Apache Spark in EMR? 如何从 Node.js 连接到 Amazon DocumentDB - How to connect to Amazon DocumentDB from Node.js 如果读取器实例和写入器实例具有不同的实例类型,DocumentDB 故障转移如何发生? - How does DocumentDB failover happen if the the reader instance and writer instance have different instance types? 如何使用 EMR Serverless 在 spark JOB 中传递 Arguments (EntryPointArguments)? - How to Pass Arguments (EntryPointArguments) in spark JOB using EMR Serverless? 如何以客户端模式远程提交到 EMR? - How to spark-submit remotely to EMR as Client mode?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM