I am trying to run a batch job in spark2 which takes a huge list as input and iterates on the list to perform the processing. The program executes fine for around 8 000 records of the list and then breaks giving exception:
WARN Lost task 0.0 in stage 421079.0 (TID 996338, acusnldlenhww4.cloudapp.net, executor 1): java.io.FileNotFoundException: /data/1/hadoop/yarn/local/usercache/A2159537-MSP01/appcache/application_1497532405817_0072/blockmgr-73dc563c-8ea5-4f2d-adfe-6c60cf3e3968/0d/shuffle_145960_0_0.index.cfb6d5ea-8c7b-41a1-acc3-2c840e7f8998 (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
(org.apache.spark.scheduler.TaskSetManager)
neo4j database is used as input. I am reading 300k nodes from neo4j as input and do a for loop on the input rdd.
Tried setting spark.shuffle.consolidateFiles
to true
in SparkConf. But that didn't work.
Increase the ulimit
if possible - to overcome this.
Either decrease the number of reducers or cores used by each node. But it has some performance impact on your job.
In general, if your cluster has :
assigned cores = `n`;
and you run a job with:
reducers = `k`
then Spark will open n * k
files in parallel and start writing.
The default ulimit is : 1024
which is too low for large scale applications.
Use ulimit -a
to see current maximum number of open files.
We can temporarily change the number of open files; by updating the system configuration files.
See these files for the same:
/etc/sysctl.conf
/etc/security/limits.conf
I faced the same issue when I applied two foreachRDD()
on the same stream. The first method will publish events to Kafka topic and the second one will write the output into HDFS.
stream.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
val batchDF = spark.createDataFrame(rdd, batchOutputSchema)
// Publish to kafka
batchDF
.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("topic", "topic_name")
.save()
})
stream.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
val batchDF = spark.createDataFrame(rdd, batchOutputSchema)
// Write the output into HDFS
batchDF
.write.mode("append")
.parquet("/path")
})
I combined the two outputs in the same foreachRDD()
and I applied the cache()
operation on the RDD.
stream.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
val batchDF = spark.createDataFrame(rdd, batchOutputSchema).cache()
// Write into HDFS
batchDF
.write.mode("append")
.parquet("/path")
// Publish to Kafka
batchDF
.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("topic", "topic_name")
.save()
})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.