I'd like to read, for example, GCP BigQuery tables in AWS Glue. I know in Spark is possible to declare dependencies for connecting to specific data-sources. How to do that within the AWS Glue environment and pass such dependencies?
In Glue it is possible to start a Spark Session like this
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("my-app") \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1')\
.getOrCreate()
so for example via the config() method is possible to provide to the Spark session the parameter spark.jars.packages
and specify the package from the Maven repository to use (in this example the one used to connect to Google BigQuery).
But this is not enough, it is also necessary to upload the jar package to S3. Afterwards provide this S3 path to the Glue job as Jar lib path / Dependent jars path
Also worth mentioning is to use --user-jars-first: "true"
param for Glue job.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.