如何在 spark 上迭代許多 Hive 腳本

Question

我有很多 hive 腳本（大約 20-25 個腳本），每個腳本都有多個查詢。 我想使用 spark 運行這些腳本，以便進程運行得更快。 由於 map 減少 hive 中的作業需要很長時間才能從 spark 執行，因此速度會快得多。 下面是我編寫的代碼，但它適用於 3-4 個文件，但是當給定具有多個查詢的多個文件時，它會失敗。

下面是相同的代碼。 如果可能的話，請幫助我進行優化。

val spark =  SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()

val filename = new java.io.File("/mapr/tmp/validation_script/").listFiles.filter(_.getName.endsWith(".hql")).toList

for ( i <- 0 to filename.length - 1)
{
  val filename1 = filename(i)

    scala.io.Source.fromFile(filename1).getLines()
  .filterNot(_.isEmpty)  // filter out empty lines
  .foreach(query =>
      spark.sql(query))


}

我得到的一些錯誤就像

ERROR SparkSubmit: Job aborted.
org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)

ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 12 (sql at validationtest.scala:67) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)

多次運行相同的代碼時會遇到許多不同類型的錯誤。

下面是其中一個 HQL 文件的外觀。 它的名字是 xyz.hql 並且有

drop table pontis_analyst.daydiff_log_sms_distribution
create table pontis_analyst.daydiff_log_sms_distribution as select round(datediff(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),cast(subscriberActivationDate as date))/7,4) as daydiff,subscriberkey as key from  pontis_analytics.prepaidsubscriptionauditlog
drop table pontis_analyst.weekly_sms_usage_distribution
create table pontis_analyst.weekly_sms_usage_distribution as select sum(event_count_ge) as eventsum,subscriber_key from pontis_analytics.factadhprepaidsubscriptionsmsevent where effective_date_ge_prt < date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) - 1 ) and effective_date_ge_prt >=  date_sub(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),84) group by subscriber_key;
drop table pontis_analyst.daydiff_sms_distribution
create table pontis_analyst.daydiff_sms_distribution as select day.daydiff,sms.subscriber_key,sms.eventsum from  pontis_analyst.daydiff_log_sms_distribution day inner join pontis_analyst.weekly_sms_usage_distribution sms on day.key=sms.subscriber_key
drop table pontis_analyst.weekly_sms_usage_final_distribution
create table pontis_analyst.weekly_sms_usage_final_distribution as select spp.subscriberkey as key, case when spp.tenure < 3 then round((lb.eventsum )/dayDiff,4) when spp.tenure >= 3 then round(lb.eventsum/12,4)end as result from pontis_analyst.daydiff_sms_distribution lb inner join pontis_analytics.prepaidsubscriptionsubscriberprofilepanel spp on spp.subscriberkey = lb.subscriber_key
INSERT INTO TABLE pontis_analyst.validatedfinalResult select 'prepaidsubscriptionsubscriberprofilepanel' as fileName, 'average_weekly_sms_last_12_weeks' as attributeName, tbl1_1.isEqual as isEqual, tbl1_1.isEqualCount as isEqualCount, tbl1_2.countAll as countAll, (tbl1_1.isEqualCount/tbl1_2.countAll)* 100 as percentage from (select tbl1_0.isEqual as isEqual, count(isEqual) as isEqualCount from (select case when round(aal.result)  = round(srctbl.average_weekly_sms_last_12_weeks) then 1 when aal.result is null then 1 when aal.result = 'NULL' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result = '' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks is null then 1 else 0  end as isEqual from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel srctbl left join  pontis_analyst.weekly_sms_usage_final_distribution aal on srctbl.subscriberkey = aal.key) tbl1_0 group by tbl1_0.isEqual) tbl1_1 inner join (select count(*) as countAll from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel) tbl1_2 on 1=1

Answer 1

您的問題是您的代碼用完了 memory ，如下所示

failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)

盡管您嘗試做的不是在 Spark 中做事的最佳方式，但我建議您刪除 memory 序列化，因為無論如何它都無濟於事。 僅當要在多個轉換中使用數據時才應緩存數據。 如果要使用一次，則沒有理由將數據放入緩存中。

如何在 spark 上迭代許多 Hive 腳本

問題描述

1 個解決方案

解決方案1
1 2019-10-19 05:18:27

如何在 spark 上迭代許多 Hive 腳本

問題描述

1 個解決方案

解決方案1 1 2019-10-19 05:18:27

解決方案1
1 2019-10-19 05:18:27