简体   繁体   中英

Pyspark awaitResult error in dataframe inner join

Running standalone spark-2.3.0-bin-hadoop2.7 inside docker container

  • df1 = 5 rows
  • df2 = 10 rows
  • Data set is very small.

    df1 schema: Dataframe[id:bigint, name:string] df2 schema: Dataframe[id:decimal(12,0), age: int]

Inner Join

df3 = df1.join(df2, df1.id == df2.id, 'inner')

df3 schema: Dataframe[id:bigint, name:string, age: int]

While executing df3.show(5) , following error occurs

Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/usr/apache/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 466, in collect
    port = self._jdf.collectToPython()   File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)   File "/usr/apache/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)   File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o43.collectToPython. : org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
        at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)

Tried setting broadcast time out to -1 as per this suggestion , but got the same error

conf = SparkConf().set("spark.sql.broadcastTimeout","-1")

I was using incompatible version of JRE with Spark 2.3.

Error got resolved after updating JRE with openjdk-8-jre in Docker Image

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM