繁体   English   中英

奴隶迷失,非常缓慢地加入火花

[英]slave lost and very slow join in spark

我在一个公共列上合并了两个数据框,然后运行了show方法:

    df= df1.join(df2, df1.col1== df2.col2, 'inner')
    df.show()

然后join跑得很慢,最后引发一个错误:奴隶迷路了。

    Py4JJavaError: An error occurred while calling o109.showString.

    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 : ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost

驱动程序堆栈跟踪:位于

org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1431)在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStageGS $ 1.apply。 scala:1419)位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1418)位于scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)位于scala org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)上的.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1 org.apache.spark.scheduler的.apply(DAGScheduler.scala:799).org的scala.Option.foreach(Option.scala:236)的DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:799)。 org.apa.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)上的apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)的che.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)位于org.apache.spark.util.EventLoop $$ anon $ 1。在org.apache.spark.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)在org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)在org.apache.spark处运行(EventLoop.scala:48) org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)上的org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)上的.SparkContext.runJob(SparkContext.scala:1845) org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)位于org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)位于org.apache.spark.sql。 Dataorg $$ anonfun $ org $ apache $ spark $ sql $ DataFrame $$ execute $ 1 $ 1.apply(DataFrame.scala:1499)at org.apache.spark.sql.DataFrame $$ anonfun $ org $ apache $ spark $ sql $ org.apach上的DataFrame $$ execute $ 1 $ 1.apply(DataFrame.scala:1499) e.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:56)位于org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)位于org.apache.spark.sql.DataFrame.org $ apache $ spark $ sql $ DataFrame $$ execute $ 1(DataFrame.scala:1498)at org.apache.spark.sql.DataFrame.org $ apache $ spark $ sql $ DataFrame $$ collect(DataFrame.scala:1505)at org.apache.spark.sql.DataFrame $$ anonfun $ head $ 1.apply(DataFrame.scala:1375)at org.apache.spark.sql.DataFrame $$ anonfun $ head $ 1.apply(DataFrame.scala:1374)at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)上的org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)处的DataFrame.scala:1456)sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)处的sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 62)在java的sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)位于py4j.Gateway处的.lang.reflect.Method.invoke(Method.java:498)。在py4j.GatewayConnection.run(GatewayConnection.java:209)在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)在py4j.commands.CallCommand.execute(CallCommand.java:79)处调用(Gateway.java:259) )在java.lang.Thread.run(Thread.java:745)

经过一些搜索,似乎这是与内存有关的问题。 然后我将分区增加到3000,增加了执行程序的内存,增加了内存开销,但仍然没有运气,我遇到了相同的从属丢失错误。 在df.show()期间,我发现其中一个执行器改组写入大小非常大,其他的都不那么高。 有什么线索吗? 谢谢

如果使用scala,请尝试

val df = df1.join(df2,Seq("column name"))

如果pyspark

df = df1.join(df2,["columnname"])

要么

df = df1.join(df2,df1.columnname == df2.columnname)
display(df)

如果尝试在pyspark中执行相同操作-SQL

df1.createOrReplaceTempView("left_test_table")
df2..createOrReplaceTempView("right_test_table")
left <- sql(sqlContext, "SELECT * FROM left_test_table")
right <- sql(sqlContext, "SELECT * FROM right_test_table")

head(drop(join(left, right), left$name))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM