简体   繁体   中英

how to set spark RDD StorageLevel in hive on spark?

In my hive on spark job , I get this error :

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

thanks for this answer ( Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode? ) , I know it may be my hiveonspark job has the same problem

since hive translates sql to a hiveonspark job, I don't how to set it in hive to make its hiveonspark job change from StorageLevel.MEMORY_ONLY to StorageLevel.MEMORY_AND_DISK ?

thanks for you help~~~~

You can use CACHE/UNCACHE [LAZY] Table <table_name> to manage caching. More details .

If you are using DataFrame's then you can use the persist(...) to specify the StorageLevel. Look at API here. .

In addition to setting the storage level, you can optimize other things as well. SparkSQL uses a different caching mechanism called Columnar storage which is a more efficient way of caching data (as SparkSQL is schema aware). There are different set of config properties that can be tuned to manage caching as described in detail here (THis is latest version documentation. Refer to the documentation of version you are using).

  • spark.sql.inMemoryColumnarStorage.compressed
  • spark.sql.inMemoryColumnarStorage.batchSize

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM