[英]Spark dataframe : When does it materialize?
I have a spark question:我有一个火花问题:
I have a job that errors out with: 403 Access Denied on S3我有一份工作出错了: 403 Access Denied on S3
The spark job basically:火花工作基本上是:
I get sporadic errors in step 3 where we are doing a bunch of transformations.在进行大量转换的第 3 步中,我偶尔会遇到错误。 I say sporadic, because sometimes I would get no errors and the other times it pops up on any one of the functions that exist in step 3.
我说零星的,因为有时我不会得到任何错误,而其他时候它会在第 3 步中存在的任何一个函数上弹出。
Wouldnt running a spark sql select statement (and storing it as temp view) on a glue dynamic frame materialize the data within the spark session in-memory?不会在胶水动态框架上运行 spark sql select 语句(并将其存储为临时视图)在内存中的 spark 会话中具体化数据吗? eg:
例如:
df=glueContext.create_dynamic_frame_from_catalog(args)
df=df.toDF
df.createorreplacetempview(tbl1)
dfnew=spark.sql(select * from tbl1)
dfnew.createorreplacetempview(tbl2)
..step3 transformations on tbl2(this is where the error happens)
Is my understanding correct in that tbl1 has materialized into the spark session in-memory, but tbl2 is still lazily stored?我的理解是否正确,因为 tbl1 已经具体化到内存中的 spark 会话中,但 tbl2 仍然被延迟存储? If so, then if I run spark sql statement on tbl2 it will materialize by querying from tbl1, not the glue catalog source tables, correct?
如果是这样,那么如果我在 tbl2 上运行 spark sql 语句,它将通过从 tbl1 查询而不是胶水目录源表来实现,对吗?
How can I ensure in the above script the LF tables are not accessed after getting them in a dynamic frame because the upstream data is continuously updated?由于上游数据不断更新,我如何确保在上述脚本中将 LF 表放入动态框架后不访问它们?
The understanding that you have of spark SQL views is not correct.您对 spark SQL 视图的理解不正确。
Spark SQL
views are lazily evaluated and don't really materialize until you call an action. Spark SQL
视图是延迟计算的,在您调用操作之前不会真正实现。 In fact, NONE of the lazily evaluated parts (also called transformations in Spark technical terms) are materialized until and unless you call an action.事实上,直到并且除非你调用一个动作,否则没有一个惰性评估的部分(在 Spark 技术术语中也称为转换)被具体化。
All it does is create a DAG in the backend with all the transformations you have done so far and materialize all that when you call an action.它所做的只是在后端创建一个 DAG,其中包含您迄今为止完成的所有转换,并在您调用操作时实现所有这些转换。
df.createorreplacetempview(tbl1) #lazily-evaluated
dfnew=spark.sql(select * from tbl1) #lazily-evaluated
dfnew.createorreplacetempview(tbl2) #lazily-evaluated
dfnew.show() #Action call --> materilaizes all the transformations done so far.
The error you are getting is most likely because of the p ermissions while reading or writing into a particular S3 location.您收到的错误很可能是因为读取或写入特定 S3 位置时的权限。
I hope this answers your first half of the question.我希望这能回答你的前半部分问题。 It can be explained better if you can share what is happening in the transformation or if you are using any action during those transformations or the best way is to share the
stacktrace
of the error to get more definitive answer.如果您可以分享转换中发生的事情,或者如果您在这些转换期间使用任何操作,或者最好的方法是共享错误的
stacktrace
以获得更明确的答案,则可以更好地解释。
Also if you are using Spark 3.0 or higher
you can materialize your transformations by using noop
write format.此外,如果您使用的是
Spark 3.0 or higher
,则可以使用noop
写入格式来实现您的转换。
df.write.mode("overwrite").format("noop").save()
You can simply specify it as the write
format and it will materialize the query and execute all the transformations but it will not write the result anywhere.您可以简单地将其指定为
write
格式,它将具体化查询并执行所有转换,但不会将结果写入任何地方。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.