简体   繁体   English

如何使用 Glue 读取多个 S3 存储桶?

[英]How can I read multiple S3 buckets using Glue?

When using Spark, I can read data from multiple buckets using the * in the prefix.使用 Spark 时,我可以使用前缀中的 * 从多个存储桶中读取数据。 For example, my folder structure is as follows:例如我的文件夹结构如下:

s3://bucket/folder/computation_date=2020-11-01/
s3://bucket/folder/computation_date=2020-11-02/
s3://bucket/folder/computation_date=2020-11-03/
etc.

Using PySpark, if I want to read all data for month 11, I can do:使用 PySpark,如果我想读取第 11 个月的所有数据,我可以这样做:

input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

How I achieve the same functionality with Glue?如何使用 Glue 实现相同的功能? The below does not seem to work:以下似乎不起作用:

input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_glue = glueContext.create_dynamic_frame_from_options(
            connection_type="s3",
            connection_options = {
                "paths": ["s3://{}/{}/".format(input_bucket, input_prefix)]
            },
            format="parquet",
            transformation_ctx="df_spark")

I read the file using spark instead of Glue我使用 spark 而不是 Glue 读取文件

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM