带有 AWS Glue 数据目录的 Kubernetes（非 EMR）上的 Spark

Question

I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks.我在 EKS 上运行 spark 作业，这些作业是从 Jupyter 笔记本提交的。

We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog.我们将所有表都放在 S3 存储桶中，它们的元数据位于 Glue 数据目录中。

I want to use the Glue Data Catalog as the Hive metastore for these Spark jobs.我想将 Glue 数据目录用作这些 Spark 作业的 Hive 元存储。 I see that it's possible to do when Spark is run in EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html我看到在 EMR 中运行 Spark 时可以这样做： https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html

but is it possible from Spark running on EKS?但是在 EKS 上运行 Spark 有可能吗？

I have seen this code released by aws: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore but I can't understand if patching of the Hive jar is necessary for what I'm trying to do. I have seen this code released by aws: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore but I can't understand if patching of the Hive jar is necessary对于我正在尝试做的事情。 Also I need the hive-site.xml file for connecting Spark to the metastore, how can I get this file from Glue Data Catalog?我还需要 hive-site.xml 文件来将 Spark 连接到 Metastore，如何从 Glue 数据目录中获取此文件？

Answer 1

I found a solution for that.我找到了解决方案。

I created a new spark image with this instructions: https://github.com/viaduct-ai/docker-spark-k8s-aws我使用以下说明创建了一个新的火花图像： https://github.com/viaduct-ai/docker-spark-k8s-aws

and finally at my job yaml file, I added some configurations最后在我的工作 yaml 文件中，我添加了一些配置

sparkConf:
   ...
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
    spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

带有 AWS Glue 数据目录的 Kubernetes（非 EMR）上的 Spark

问题描述

1 个解决方案

解决方案1
1 2022-06-02 01:06:38

带有 AWS Glue 数据目录的 Kubernetes（非 EMR）上的 Spark

问题描述

1 个解决方案

解决方案1 1 2022-06-02 01:06:38

解决方案1
1 2022-06-02 01:06:38