简体   繁体   English

AWS Glue作业流程

[英]AWS Glue Job Flow

I have an ETL job in Glue that processes a very large (300M row) JDBC database table, but I really only need a subset (certain ids) from this table. 我在Glue中有一个ETL作业,可以处理一个很大的(300M行)JDBC数据库表,但是我实际上只需要该表的一个子集(某些ID)。 When I do glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons") Does this load the entire table at this command? 当我做glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")是否以此命令加载整个表? Is there a way to write a custom query to load only the data I need? 有没有办法编写自定义查询以仅加载我需要的数据? Or if I follow this with another command say Filter or a spark SQL command on the DataFrame will that filter as the data is pulled? 或者,如果我在这之后执行另一个命令,比如说在DataFrame上使用Filter或spark SQL命令,那么在提取数据时该过滤器会过滤吗?

Well, when you run: 好吧,当您运行时:

glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")

It only creates a Spark DF reference. 它仅创建一个Spark DF参考。

Spark works with transformations (ie filter, map, select) and actions (ie collect, count, show). Spark适用于转换(即过滤,映射,选择)和动作(即收集,计数,显示)。 You can read more about it here How Apache Spark's Transformations And Action works , but basically, your database table only will load to memory when a action is called. 您可以在此处阅读有关Apache Spark的Transformations and Action如何工作的更多信息 ,但基本上,只有在调用action时,数据库表才会加载到内存中。 This is one of many reasons Spark is so powerful and recommended to work with any size dataset. 这是Spark如此强大并建议与任何大小的数据集一起使用的众多原因之一。

This PDF show all transformations and actions available and some samples using them. PDF显示所有可用的转换和操作以及使用它们的一些示例。

So yes, you need do some steps before like: 因此,是的,您需要先执行一些步骤,例如:

df = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")
df = df.filter(YOUR_FILTER).select(SPECIFIC_COLS)

# Calling an action to show the filtered DF
df.show()

This will guarantee that you only load specific columns and rows to memory 这将确保您仅将特定的列和行加载到内存中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM