如何在 spark 上迭代许多 Hive 脚本

Question

I have many hive scripts (somewhat 20-25 scripts), each scripts having multiple queries.我有很多 hive 脚本（大约 20-25 个脚本），每个脚本都有多个查询。 I want to run these scripts using spark so that the process can run faster.我想使用 spark 运行这些脚本，以便进程运行得更快。 As map reduce job in hive takes long time to execute from spark it will be much faster.由于 map 减少 hive 中的作业需要很长时间才能从 spark 执行，因此速度会快得多。 Below is the code I have written but its working for 3-4 files but when given multiple files with multiple queries its getting failed.下面是我编写的代码，但它适用于 3-4 个文件，但是当给定具有多个查询的多个文件时，它会失败。

Below is the code for the same.下面是相同的代码。 Please help me if possible to optimize the same.如果可能的话，请帮助我进行优化。

val spark =  SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()

val filename = new java.io.File("/mapr/tmp/validation_script/").listFiles.filter(_.getName.endsWith(".hql")).toList

for ( i <- 0 to filename.length - 1)
{
  val filename1 = filename(i)

    scala.io.Source.fromFile(filename1).getLines()
  .filterNot(_.isEmpty)  // filter out empty lines
  .foreach(query =>
      spark.sql(query))


}

some of the error I cam getting is like我得到的一些错误就像

ERROR SparkSubmit: Job aborted.
org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)

ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 12 (sql at validationtest.scala:67) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)

many different types of error I get when run the same code multiple times.多次运行相同的代码时会遇到许多不同类型的错误。

Below is how one of the HQL file looks like.下面是其中一个 HQL 文件的外观。 its name is xyz.hql and has它的名字是 xyz.hql 并且有

drop table pontis_analyst.daydiff_log_sms_distribution
create table pontis_analyst.daydiff_log_sms_distribution as select round(datediff(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),cast(subscriberActivationDate as date))/7,4) as daydiff,subscriberkey as key from  pontis_analytics.prepaidsubscriptionauditlog
drop table pontis_analyst.weekly_sms_usage_distribution
create table pontis_analyst.weekly_sms_usage_distribution as select sum(event_count_ge) as eventsum,subscriber_key from pontis_analytics.factadhprepaidsubscriptionsmsevent where effective_date_ge_prt < date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) - 1 ) and effective_date_ge_prt >=  date_sub(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),84) group by subscriber_key;
drop table pontis_analyst.daydiff_sms_distribution
create table pontis_analyst.daydiff_sms_distribution as select day.daydiff,sms.subscriber_key,sms.eventsum from  pontis_analyst.daydiff_log_sms_distribution day inner join pontis_analyst.weekly_sms_usage_distribution sms on day.key=sms.subscriber_key
drop table pontis_analyst.weekly_sms_usage_final_distribution
create table pontis_analyst.weekly_sms_usage_final_distribution as select spp.subscriberkey as key, case when spp.tenure < 3 then round((lb.eventsum )/dayDiff,4) when spp.tenure >= 3 then round(lb.eventsum/12,4)end as result from pontis_analyst.daydiff_sms_distribution lb inner join pontis_analytics.prepaidsubscriptionsubscriberprofilepanel spp on spp.subscriberkey = lb.subscriber_key
INSERT INTO TABLE pontis_analyst.validatedfinalResult select 'prepaidsubscriptionsubscriberprofilepanel' as fileName, 'average_weekly_sms_last_12_weeks' as attributeName, tbl1_1.isEqual as isEqual, tbl1_1.isEqualCount as isEqualCount, tbl1_2.countAll as countAll, (tbl1_1.isEqualCount/tbl1_2.countAll)* 100 as percentage from (select tbl1_0.isEqual as isEqual, count(isEqual) as isEqualCount from (select case when round(aal.result)  = round(srctbl.average_weekly_sms_last_12_weeks) then 1 when aal.result is null then 1 when aal.result = 'NULL' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result = '' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks is null then 1 else 0  end as isEqual from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel srctbl left join  pontis_analyst.weekly_sms_usage_final_distribution aal on srctbl.subscriberkey = aal.key) tbl1_0 group by tbl1_0.isEqual) tbl1_1 inner join (select count(*) as countAll from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel) tbl1_2 on 1=1

Answer 1

Your issue is your code is running out of memory as shown below您的问题是您的代码用完了 memory ，如下所示

failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)

Though what you are trying to do is not optimal way of doing things in Spark but I would recommend that you remove the memory serialization as it will not help in anyways.尽管您尝试做的不是在 Spark 中做事的最佳方式，但我建议您删除 memory 序列化，因为无论如何它都无济于事。 You should cache data only if it is going to be used in multiple transformations.仅当要在多个转换中使用数据时才应缓存数据。 If it is going to be used once there is no reason to put the data in cache.如果要使用一次，则没有理由将数据放入缓存中。

如何在 spark 上迭代许多 Hive 脚本

问题描述

1 个解决方案

解决方案1
1 2019-10-19 05:18:27

如何在 spark 上迭代许多 Hive 脚本

问题描述

1 个解决方案

解决方案1 1 2019-10-19 05:18:27

解决方案1
1 2019-10-19 05:18:27