奴隶迷失，非常缓慢地加入火花

Question

我在一个公共列上合并了两个数据框，然后运行了show方法：

    df= df1.join(df2, df1.col1== df2.col2, 'inner')
    df.show()

然后join跑得很慢，最后引发一个错误：奴隶迷路了。

    Py4JJavaError: An error occurred while calling o109.showString.

    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 : ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost

驱动程序堆栈跟踪：位于

org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages（DAGScheduler.scala：1431）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStageGS $ 1.apply。 scala：1419）位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1418）位于scala.collection.mutable.ResizableArray $ class.foreach（ResizableArray.scala：59）位于scala org.apache.spark.scheduler.DAGScheduler.abortStage（DAGScheduler.scala：1418）上的.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala：47）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1 org.apache.spark.scheduler的.apply（DAGScheduler.scala：799）.org的scala.Option.foreach（Option.scala：236）的DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：799）。 org.apa.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive（DAGScheduler.scala：1640）上的apache.spark.scheduler.DAGScheduler.handleTaskSetFailed（DAGScheduler.scala：799）位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1588）的che.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1599）位于org.apache.spark.util.EventLoop $$ anon $ 1。在org.apache.spark.spark.scheduler.DAGScheduler.runJob（DAGScheduler.scala：620）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1832）在org.apache.spark处运行（EventLoop.scala：48） org.apache.spark.SparkContext.runJob（SparkContext.scala：1858）上的org.apache.spark.sql.execution.SparkPlan.executeTake（SparkPlan.scala：212）上的.SparkContext.runJob（SparkContext.scala：1845） org.apache.spark.sql.execution.Limit.executeCollect（basicOperators.scala：165）位于org.apache.spark.sql.execution.SparkPlan.executeCollectPublic（SparkPlan.scala：174）位于org.apache.spark.sql。 Dataorg $$ anonfun $ org $ apache $ spark $ sql $ DataFrame $$ execute $ 1 $ 1.apply（DataFrame.scala：1499）at org.apache.spark.sql.DataFrame $$ anonfun $ org $ apache $ spark $ sql $ org.apach上的DataFrame $$ execute $ 1 $ 1.apply（DataFrame.scala：1499） e.spark.sql.execution.SQLExecution $ .withNewExecutionId（SQLExecution.scala：56）位于org.apache.spark.sql.DataFrame.withNewExecutionId（DataFrame.scala：2086）位于org.apache.spark.sql.DataFrame.org $ apache $ spark $ sql $ DataFrame $$ execute $ 1（DataFrame.scala：1498）at org.apache.spark.sql.DataFrame.org $ apache $ spark $ sql $ DataFrame $$ collect（DataFrame.scala：1505）at org.apache.spark.sql.DataFrame $$ anonfun $ head $ 1.apply（DataFrame.scala：1375）at org.apache.spark.sql.DataFrame $$ anonfun $ head $ 1.apply（DataFrame.scala：1374）at org.apache.spark.sql.DataFrame.head（DataFrame.scala：1374）上的org.apache.spark.sql.DataFrame.withCallback（DataFrame.scala：2099） org.apache.spark.sql.DataFrame.showString（DataFrame.scala：170）处的DataFrame.scala：1456）sun.reflect.NativeMethodAccessorImpl.invoke0（Native Method）处的sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java： 62）在java的sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43） py4j.reflection.MethodInvoker.invoke（MethodInvoker.java:231）位于py4j.reflection.ReflectionEngine.invoke（ReflectionEngine.java:381）位于py4j.Gateway处的.lang.reflect.Method.invoke（Method.java:498）。在py4j.GatewayConnection.run（GatewayConnection.java:209）在py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:133）在py4j.commands.CallCommand.execute（CallCommand.java:79）处调用（Gateway.java:259））在java.lang.Thread.run（Thread.java:745）

经过一些搜索，似乎这是与内存有关的问题。 然后我将分区增加到3000，增加了执行程序的内存，增加了内存开销，但仍然没有运气，我遇到了相同的从属丢失错误。 在df.show（）期间，我发现其中一个执行器改组写入大小非常大，其他的都不那么高。 有什么线索吗？ 谢谢

Answer 1

如果使用scala，请尝试

val df = df1.join(df2,Seq("column name"))

如果pyspark

df = df1.join(df2,["columnname"])

要么

df = df1.join(df2,df1.columnname == df2.columnname)
display(df)

如果尝试在pyspark中执行相同操作-SQL

df1.createOrReplaceTempView("left_test_table")
df2..createOrReplaceTempView("right_test_table")
left <- sql(sqlContext, "SELECT * FROM left_test_table")
right <- sql(sqlContext, "SELECT * FROM right_test_table")

head(drop(join(left, right), left$name))

奴隶迷失，非常缓慢地加入火花

问题描述

1 个解决方案

解决方案1
1 2017-10-07 02:32:53

奴隶迷失，非常缓慢地加入火花

问题描述

1 个解决方案

解决方案1 1 2017-10-07 02:32:53

解决方案1
1 2017-10-07 02:32:53