How to pass AWS Glue external Spark packages?

Question

I'd like to read, for example, GCP BigQuery tables in AWS Glue. I know in Spark is possible to declare dependencies for connecting to specific data-sources. How to do that within the AWS Glue environment and pass such dependencies?

Answer 1

In Glue it is possible to start a Spark Session like this

from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .appName("my-app") \
    .config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1')\
    .getOrCreate()

so for example via the config() method is possible to provide to the Spark session the parameter spark.jars.packages and specify the package from the Maven repository to use (in this example the one used to connect to Google BigQuery).

But this is not enough, it is also necessary to upload the jar package to S3. Afterwards provide this S3 path to the Glue job as Jar lib path / Dependent jars path

Answer 2

Also worth mentioning is to use --user-jars-first: "true" param for Glue job.

How to pass AWS Glue external Spark packages?

Question

2 answers

solution1
0 2021-04-07 13:41:23

solution2
0 2022-11-14 22:45:26

How to pass AWS Glue external Spark packages?

Question

2 answers

solution1 0 2021-04-07 13:41:23

solution2 0 2022-11-14 22:45:26

solution1
0 2021-04-07 13:41:23

solution2
0 2022-11-14 22:45:26