簡體   English   中英

將 Glue DynamicFrame 轉換為 Spark 時出現 Spark 超時錯誤 Dataframe

[英]Spark Timeout Error while converting Glue DynamicFrame to Spark Dataframe

我們有一個從 Aurora DB 中的表中讀取數據的粘合作業。 我們正在使用下面的調用來讀取 Aurora

df = self.glue_context.create_dynamic_frame.from_options(
            connection_type="custom.jdbc",
            connection_options={
                "className": self.jdbc_driver_name,
                "url": self.aurora_url,
                "user": self.db_username,
                "password": self.db_password,
                "query": query,
                "hashexpression": hash_expression,
                "hashpartitions": hash_partition,
            },
        )
      

我們正在嘗試將其轉換為 Spark Dataframe 以保留從表中獲取的數據。

     targetdf = df.toDF()
     tragetdf = tragetdf.select(col("col1").alias("col1"),
                                                 col("col2").alias("col2")
                                                 ).repartition(int(partitions))
    tragetdf.persist()

從 Aurora DB 返回的數據非常龐大(幾百萬)並存儲在 Glue DynamicFrame 中。 當我們嘗試將 DynamicFrame 轉換為 Spark Dataframe 時,它會拋出超時錯誤。 它適用於少量數據(~50k 記錄)有人可以建議可能出了什么問題,或者是否有任何其他更好的方法來實現這種情況。

錯誤

    Traceback (most recent call last):
  File \"/tmp/code.py\", line 857, in <module>
    init()
  File \"/tmp/code.py\", line 853, in init
    main(args, glue_context, spark, current_path, bucket_name, file_path, fixed_path, i_output, final_output_path)
  File \"/tmp/code.py\", line 362, in main
    brk_df = fetch_from_aurora(args, glue_context, source_table, hash_expression_aurora, target_query)
  File \"/tmp/code.py\", line 274, in fetch_from_aurora
    df = intermtntdf.toDF()
  File \"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py\", line 148, in toDF
    return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)
  File \"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 63, in deco
    return f(*a, **kw)
  File \"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\", line 328, in get_return_value
    format(target_id, \".\", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o135.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, 
most recent failure: Lost task 0.3 in stage 8.0 (TID 11, 10.156.19.74, executor 12):
 ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 648979 ms
Driver stacktrace:
\tat org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
\tat scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
\tat 

很可能您在執行程序上導致了 OOM(內存不足)異常,這是心跳超時錯誤的最常見原因,因為 toDF() 是一個昂貴的過程。

從我的角度來看,您可以嘗試:

  • 在某種循環中打破小批量的極光獲取(不要忘記在每次迭代后取消保留數據);
  • 增加 Glue worker 的大小及其最大 DPU(我認為 10 是極限);
  • twink with the executor max memory,你可以在這個問題中找到一些指導: AWS Glue executor memory limit

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM