My cluster size is 3 nodes having 8 GB RAM
and 2 core
each.I am increasing the executor memory in following way for spark :
//creating spark session
val spark = SparkSession
.builder()
.appName(s"${this.getClass.getSimpleName}")
.config("spark.sql.shuffle.partitions", "9")
.config("spark.executor.memory", "3g")
.config("spark.executor.cores", "1")
.master("local[*]")
.getOrCreate()
Thus 4 executor with 3gigs of RAM each will launch having one task per core.
The code i am executing here is as follows:
val seq2 = List((125,0),(125,125),(125,250),(125,375))
val urls = spark.sparkContext.parallelize(seq2).toDF()
val actual_data = urls.map(x => HTTPRequestParallel.ds(x.getInt(0).toString,x.getInt(1).toString,t0)).persist(StorageLevel.MEMORY_AND_DISK)
val dataframe = spark.read.option("header","true").json(actual_data)
When i am calling 4 web-api in parallel which is returning around 1 gigs of data per call which is getting serialized in one method,i am still getting java heap memory
issue.
As i know api is synchronized call,so it will be fetching and storing incoming data somewhere.Where is that location,is it jvm heap memory of node or executor memory assigned?
Increase shuffle.partition
to 1000 or more, It should resolve the issue.
Also you can try using spark.default.parallelism
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.