简体   繁体   中英

spark (java) - Too many open files

I am trying to run a batch job in spark2 which takes a huge list as input and iterates on the list to perform the processing. The program executes fine for around 8 000 records ​of the list and then breaks giving exception:

WARN Lost task 0.0 in stage 421079.0 (TID 996338, acusnldlenhww4.cloudapp.net, executor 1): java.io.FileNotFoundException: /data/1/hadoop/yarn/local/usercache/A2159537-MSP01/appcache/application_1497532405817_0072/blockmgr-73dc563c-8ea5-4f2d-adfe-6c60cf3e3968/0d/shuffle_145960_0_0.index.cfb6d5ea-8c7b-41a1-acc3-2c840e7f8998 (Too many open files)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
        at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
     (org.apache.spark.scheduler.TaskSetManager)

neo4j database is used as input. I am reading 300k nodes from neo4j as input and do a for loop on the input rdd.

Tried setting spark.shuffle.consolidateFiles to true in SparkConf. But that didn't work.

Increase the ulimit if possible - to overcome this.

Either decrease the number of reducers or cores used by each node. But it has some performance impact on your job.

In general, if your cluster has :

assigned cores = `n`; 

and you run a job with:

reducers = `k`

then Spark will open n * k files in parallel and start writing.

The default ulimit is : 1024 which is too low for large scale applications.

Use ulimit -a to see current maximum number of open files.

We can temporarily change the number of open files; by updating the system configuration files.

See these files for the same:

/etc/sysctl.conf
/etc/security/limits.conf

I faced the same issue when I applied two foreachRDD() on the same stream. The first method will publish events to Kafka topic and the second one will write the output into HDFS.

    stream.foreachRDD(rdd => {
      val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()

      val batchDF = spark.createDataFrame(rdd, batchOutputSchema)
      // Publish to kafka
      batchDF
        .write.format("kafka")
        .option("kafka.bootstrap.servers", bootstrapServer)
        .option("topic", "topic_name")
        .save()
    })

    stream.foreachRDD(rdd => {
      val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()

      val batchDF = spark.createDataFrame(rdd, batchOutputSchema)
      // Write the output into HDFS
      batchDF
        .write.mode("append")
        .parquet("/path")
    })

I combined the two outputs in the same foreachRDD() and I applied the cache() operation on the RDD.

    stream.foreachRDD(rdd => {
      val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()

      val batchDF = spark.createDataFrame(rdd, batchOutputSchema).cache()
      
      // Write into HDFS
      batchDF
        .write.mode("append")
        .parquet("/path")

      // Publish to Kafka
      batchDF
        .write.format("kafka")
        .option("kafka.bootstrap.servers", bootstrapServer)
        .option("topic", "topic_name")
        .save()

    })

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM