简体   繁体   中英

How to pass AWS Glue external Spark packages?

I'd like to read, for example, GCP BigQuery tables in AWS Glue. I know in Spark is possible to declare dependencies for connecting to specific data-sources. How to do that within the AWS Glue environment and pass such dependencies?

In Glue it is possible to start a Spark Session like this

from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .appName("my-app") \
    .config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1')\
    .getOrCreate()

so for example via the config() method is possible to provide to the Spark session the parameter spark.jars.packages and specify the package from the Maven repository to use (in this example the one used to connect to Google BigQuery).

But this is not enough, it is also necessary to upload the jar package to S3. Afterwards provide this S3 path to the Glue job as Jar lib path / Dependent jars path

Also worth mentioning is to use --user-jars-first: "true" param for Glue job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM