简体   繁体   English

AWS Glue:ETL 读取 S3 CSV 文件

[英]AWS Glue: ETL to read S3 CSV files

I want to use ETL to read data from S3.我想使用 ETL 从 S3 读取数据。 Since with ETL jobs I can set DPU to hopefully speed things up.由于使用 ETL 作业,我可以设置 DPU 以希望加快速度。

But how do I do it?但是我该怎么做呢? I tried我试过了

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"]}, format = "csv")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://pinfare-glue/testing-output"}, format = "parquet")

But it appears there is nothing written.但似乎什么也没写。 My folder looks like:我的文件夹如下所示:

在此处输入图像描述

Whats incorrect?什么不正确? My output S3 only has a file like: testing_output_$folder$我的 output S3 只有一个文件,如: testing_output_$folder$

I believe the issue here is that you have subfolders within testing-csv folder and since you did not specify recurse to be true, Glue is not able to find the files in the 2018-09-26 subfolder (or in fact any other subfolders).我相信这里的问题是您在 testing-csv 文件夹中有子文件夹,并且由于您没有指定 recurse为真,Glue 无法在 2018-09-26 子文件夹(或实际上任何其他子文件夹)中找到文件.

You need to add the recurse option as follows您需要添加递归选项如下

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"], "recurse"=True}, format = "csv")

Also, regarding your question about crawlers in the comments, they help to infer the schema of your data files.此外,关于您在评论中关于爬虫的问题,它们有助于推断您的数据文件的架构。 So, in your case here does nothing since you are creating the dynamicFrame directly from s3.因此,在您的情况下,这里什么都不做,因为您是直接从 s3 创建 dynamicFrame 的。

Worked for me when I removed the recurse setup.当我删除递归设置时为我工作。

enter image description here在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM