简体   繁体   中英

SparkException error for a Spark-SQL query

I am encountering the following error when I am running the below Spark-sql query.

org.apache.spark.SparkExecution: Job aborted due to stage failure: Total size of serialized results of 33 tasks (1046.8 MB) is larger than spark.driver.maxResultSize (1024.0 MB)

The source table record counts are:

contrib = 734539064
contrib (with filters 2018) = 683878665
date_dim = 73416

The query is:

select policy,
       order,
       covr_fr,
       covr_to,
       sum(fc_am) - SUM(mc_am) As amount
from (select coalesce(cntr.vre2, cntr.cacont_acc)) as policy,
             cntr.order,
             date_dim_get_sk.date_dim_sk as pst_dt_sk,
             cntr.covr_fr,
             cntr.covr_to
      from (select * from contrib
            where ind_cd = 'BP'
            and flg IN ('001', '004'
            and covr_fr > '2018-01-01' ) cntr
JOIN date_dim ON date_dim.dt = cntr.pstng_dt
JOIN date_dim_get_sk ON date_dim_get_sk.dt = date_dim.dt_lst_dayofmon
GROUP BY policy,
         order,
         covr_fr,
         covr_to
HAVING sum(fc_am) - SUM(mc_am) > 0

Currently this query is failing with the aforementioned error. I have tried to cache the contrib table but in vain. Can anyone please help me fix the above error and tune this query and make it workable. Please let me know if additional information is required.

Thanks

Have you looked at this topic ?

It's hard to suggest something only upon SQL query.
As the simplest workable solution try to increase spark.driver.maxResultSize

config("spark.driver.maxResultSize", "0")

0 stands for unlimited

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM