简体   繁体   English

AWS Glue ETL作业如何检索数据?

[英]How does AWS Glue ETL job retrieve data?

I'm new to using AWS Glue and I don't understand how the ETL job gathers the data. 我是使用AWS Glue的新手,但我不了解ETL作业如何收集数据。 I used a crawler to generate my table schema from some files in an S3 bucket and examined the autogenerated script in the ETL job, which is here (slightly modified): 我使用搜寻器从S3存储桶中的某些文件生成了表架构,并检查了ETL作业中的自动生成的脚本,该脚本在这里(略有修改):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

When I run this job, it successfully takes my data from the bucket that my crawler used to generate the table schema and it puts the data into my destination s3 bucket as expected. 当我运行此作业时,它将成功地从我的搜寻器用来生成表架构的存储桶中获取数据,并将其按预期放入我的目标s3存储桶中。

My question is this: I don't see anywhere in this script where the data is "loaded", so to speak. 我的问题是:可以这么说,我在脚本中没有看到数据被“加载”的任何地方。 I know I point it at the table that was generated by the crawler, but from this doc : 我知道我将其指向由搜寻器生成的表,但是来自此doc

Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. AWS Glue中的表和数据库是AWS Glue数据目录中的对象。 They contain metadata; 它们包含元数据; they don't contain data from a data store. 它们不包含数据存储中的数据。

If the table only contains metadata, how are the files from the data store (in my case, an S3 bucket) retrieved by the ETL job? 如果表仅包含元数据,那么ETL作业如何从数据存储(在我的情况下为S3存储桶)中检索文件? I'm asking primarily because I'd like to somehow modify the ETL job to transform identically structured files in a different bucket without having to write a new crawler, but also because I'd like to strengthen my general understanding of the Glue service. 我之所以这样问,主要是因为我想以某种方式修改ETL作业以在另一个存储桶中转换结构相同的文件,而不必编写新的搜寻器,而且还因为我想加强对Glue服务的一般理解。

The main thing to understand is: Glue datasource catalog (datebasess and tables) are always in sync with Athena,which is serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. 需要了解的主要内容是:Glue数据源目录(日期数据库和表)始终与Athena同步,Athena是无服务器查询服务,可轻松使用标准SQL在Amazon S3中分析数据。 You can either create tables/databases from Glue Console / Athena Query console. 您可以从Glue Console / Athena Query控制台创建表/数据库。

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")

This above line of Glue Spark code is doing the magic for you in creating the initial dataframe using Glue data catalog source table, apart from the metadata, schema and table properties it also have the Location pointed to your Data Store (s3 location), where your data resides. 上面的Glue Spark代码行为您使用Glue数据目录源表创建初始数据帧带来了魔力,除了元数据,架构和表属性,它还具有指向您的数据存储的位置(s3位置),其中您的数据所在。

在此处输入图片说明

after applymapping has been done, this portion (datasink) of code is doing the actual loading of data into your target cluster/database. 在完成applymapping之后,这部分代码(数据库)正在将数据实际加载到目标集群/数据库中。

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

If you drill down deep into the AWS Glue Data Catalog. 如果您深入了解AWS Glue数据目录。 It has tables residing under the databases. 它具有驻留在数据库下的表。 By clicking on these tables you get exposed to the metadata which shows which the s3 folder where the the current table is being pointed towards as a result of the crawler run. 通过单击这些表,您将获得元数据,该元数据显示由于搜寻器运行而指向当前表的s3文件夹。

You can still create tables over an s3 structured file manually by adding tables via data catalog option: 您仍然可以通过数据目录选项添加表来手动在s3结构化文件上创建表:

在此处输入图片说明

and pointing it to your s3 location. 并将其指向您的s3位置。

Another way is to use AWS-athena console to create tables pointing s3 locations. 另一种方法是使用AWS-athena控制台创建指向s3位置的表。 You would be using a regular create table script with the location field holding your s3 location. 您将使用常规的创建表脚本,其中location字段保存您的s3位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM