简体   繁体   English

Spark 数据框:它什么时候实现?

[英]Spark dataframe : When does it materialize?

I have a spark question:我有一个火花问题:

I have a job that errors out with: 403 Access Denied on S3我有一份工作出错了: 403 Access Denied on S3

The spark job basically:火花工作基本上是:

  1. Gets data from LF resource linked tables from Glue Catalog从 Glue Catalog 的 LF 资源链接表中获取数据
  2. Creates temp views创建临时视图
  3. Runs a bunch of transformations运行一堆转换
  4. Stores the data in an external location将数据存储在外部位置

I get sporadic errors in step 3 where we are doing a bunch of transformations.在进行大量转换的第 3 步中,我偶尔会遇到错误。 I say sporadic, because sometimes I would get no errors and the other times it pops up on any one of the functions that exist in step 3.我说零星的,因为有时我不会得到任何错误,而其他时候它会在第 3 步中存在的任何一个函数上弹出。

Wouldnt running a spark sql select statement (and storing it as temp view) on a glue dynamic frame materialize the data within the spark session in-memory?不会在胶水动态框架上运行 spark sql select 语句(并将其存储为临时视图)在内存中的 spark 会话中具体化数据吗? eg:例如:

    df=glueContext.create_dynamic_frame_from_catalog(args)
    df=df.toDF
    df.createorreplacetempview(tbl1)
    dfnew=spark.sql(select * from tbl1)
    dfnew.createorreplacetempview(tbl2)


..step3 transformations on tbl2(this is where the error happens)

Is my understanding correct in that tbl1 has materialized into the spark session in-memory, but tbl2 is still lazily stored?我的理解是否正确,因为 tbl1 已经具体化到内存中的 spark 会话中,但 tbl2 仍然被延迟存储? If so, then if I run spark sql statement on tbl2 it will materialize by querying from tbl1, not the glue catalog source tables, correct?如果是这样,那么如果我在 tbl2 上运行 spark sql 语句,它将通过从 tbl1 查询而不是胶水目录源表来实现,对吗?

How can I ensure in the above script the LF tables are not accessed after getting them in a dynamic frame because the upstream data is continuously updated?由于上游数据不断更新,我如何确保在上述脚本中将 LF 表放入动态框架后不访问它们?

The understanding that you have of spark SQL views is not correct.您对 spark SQL 视图的理解不正确。

Spark SQL views are lazily evaluated and don't really materialize until you call an action. Spark SQL视图是延迟计算的,在您调用操作之前不会真正实现。 In fact, NONE of the lazily evaluated parts (also called transformations in Spark technical terms) are materialized until and unless you call an action.事实上,直到并且除非你调用一个动作,否则没有一个惰性评估的部分(在 Spark 技术术语中也称为转换)被具体化。

All it does is create a DAG in the backend with all the transformations you have done so far and materialize all that when you call an action.它所做的只是在后端创建一个 DAG,其中包含您迄今为止完成的所有转换,并在您调用操作时实现所有这些转换。

df.createorreplacetempview(tbl1) #lazily-evaluated
dfnew=spark.sql(select * from tbl1) #lazily-evaluated
dfnew.createorreplacetempview(tbl2) #lazily-evaluated
dfnew.show() #Action call --> materilaizes all the transformations done so far.

The error you are getting is most likely because of the p ermissions while reading or writing into a particular S3 location.您收到的错误很可能是因为读取或写入特定 S3 位置时的权限。

I hope this answers your first half of the question.我希望这能回答你的前半部分问题。 It can be explained better if you can share what is happening in the transformation or if you are using any action during those transformations or the best way is to share the stacktrace of the error to get more definitive answer.如果您可以分享转换中发生的事情,或者如果您在这些转换期间使用任何操作,或者最好的方法是共享错误的stacktrace以获得更明确的答案,则可以更好地解释。

Also if you are using Spark 3.0 or higher you can materialize your transformations by using noop write format.此外,如果您使用的是Spark 3.0 or higher ,则可以使用noop写入格式来实现您的转换。

df.write.mode("overwrite").format("noop").save()

You can simply specify it as the write format and it will materialize the query and execute all the transformations but it will not write the result anywhere.您可以简单地将其指定为write格式,它将具体化查询并执行所有转换,但不会将结果写入任何地方。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 spark dataframe 写入云存储会引发错误 - Writing spark dataframe to Cloud Storage throws error 从查询中自动物化视图 - Automatically materialize view from query 在 Apache Beam 的 SparkRunner 中,DOCKER environment_type 如何影响现有的 Spark 集群? - In Apache Beam's SparkRunner, how does the DOCKER environment_type affect an existing Spark cluster? Spark 是否允许使用 Amazon Assumed Role 和 STS 临时凭证在 EMR 上进行 Glue 跨账户访问 - Does Spark allow to use Amazon Assumed Role and STS temporary credentials for Glue cross account access on EMR 使用 to_dataframe() 作为 BigQuery 管理员角色时出现 BigQuery 权限错误 - BigQuery Permission error when using to_dataframe() as BigQuery Admin role 如果我使用 Dataproc,它如何处理从 Apache Hadoop 和 Spark 到 Dataproc 的实时流数据? - If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc? Python 中的 Google BigQuery 查询在使用 result() 时有效,但在使用 to_dataframe() 时出现权限问题 - Google BigQuery query in Python works when using result(), but Permission issue when using to_dataframe() 明明看到实例ID也不存在 - The instance ID does not exist even when I clearly see it 尝试将镶木地板文件写入 S3 存储桶时出现 PySpark SparkSession 错误:org.apache.spark.SparkException:写入行时任务失败 - PySpark SparkSession error when trying to write parquet files to S3 bucket: org.apache.spark.SparkException: Task failed while writing rows 使用 CI_COMMIT_MESSAGE 时管道不运行 - Pipeline does not run when using CI_COMMIT_MESSAGE
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM