简体   繁体   English

带有 AWS Glue 数据目录的 Kubernetes(非 EMR)上的 Spark

[英]Spark on Kubernetes (non EMR) with AWS Glue Data Catalog

I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks.我在 EKS 上运行 spark 作业,这些作业是从 Jupyter 笔记本提交的。

We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog.我们将所有表都放在 S3 存储桶中,它们的元数据位于 Glue 数据目录中。

I want to use the Glue Data Catalog as the Hive metastore for these Spark jobs.我想将 Glue 数据目录用作这些 Spark 作业的 Hive 元存储。 I see that it's possible to do when Spark is run in EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html我看到在 EMR 中运行 Spark 时可以这样做: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html

but is it possible from Spark running on EKS?但是在 EKS 上运行 Spark 有可能吗?

I have seen this code released by aws: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore but I can't understand if patching of the Hive jar is necessary for what I'm trying to do. I have seen this code released by aws: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore but I can't understand if patching of the Hive jar is necessary对于我正在尝试做的事情。 Also I need the hive-site.xml file for connecting Spark to the metastore, how can I get this file from Glue Data Catalog?我还需要 hive-site.xml 文件来将 Spark 连接到 Metastore,如何从 Glue 数据目录中获取此文件?

I found a solution for that.我找到了解决方案。

I created a new spark image with this instructions: https://github.com/viaduct-ai/docker-spark-k8s-aws我使用以下说明创建了一个新的火花图像: https://github.com/viaduct-ai/docker-spark-k8s-aws

and finally at my job yaml file, I added some configurations最后在我的工作 yaml 文件中,我添加了一些配置

sparkConf:
   ...
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
    spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM