简体   繁体   English

Spark-随机播放“打开的文件太多”

[英]Spark - “too many open files” in shuffle

Using Spark 1.1 使用Spark 1.1

I have 2 datasets. 我有2个数据集。 One is very large and the other was reduced (using some 1:100 filtering) to much smaller scale. 一个很大,另一个减小(使用了1:100过滤)到较小的比例。 I need to reduce the large dataset to the same scale, by joining only those items from the smaller list with their corresponding counterparts in the larger list (those lists contain elements that have a mutual join field). 我只需要通过将较小列表中的那些项与它们在较大列表中的对应项相连接(那些列表包含具有相互联接字段的元素)来将大型数据集缩小到相同的比例。

I am doing that using the following code: 我正在使用以下代码:

  • The "if(joinKeys != null)" part is the relevant part “ if(joinKeys!= null)”部分是相关部分
  • Smaller list is "joinKeys", larger list is "keyedEvents" 较小的列表是“ joinKeys”,较大的列表是“ keyedEvents”

     private static JavaRDD<ObjectNode> createOutputType(JavaRDD jsonsList, final String type, String outputPath,JavaPairRDD<String,String> joinKeys) { outputPath = outputPath + "/" + type; JavaRDD events = jsonsList.filter(new TypeFilter(type)); // This is in case we need to narrow the list to match some other list of ids... Recommendation List, for example... :) if(joinKeys != null) { JavaPairRDD<String,ObjectNode> keyedEvents = events.mapToPair(new KeyAdder("requestId")); JavaRDD < ObjectNode > joinedEvents = joinKeys.join(keyedEvents).values().map(new PairToSecond()); events = joinedEvents; } JavaPairRDD<String,Iterable<ObjectNode>> groupedEvents = events.mapToPair(new KeyAdder("sliceKey")).groupByKey(); // Add convert jsons to strings and add "\\n" at the end of each JavaPairRDD<String, String> groupedStrings = groupedEvents.mapToPair(new JsonsToStrings()); groupedStrings.saveAsHadoopFile(outputPath, String.class, String.class, KeyBasedMultipleTextOutputFormat.class); return events; } 

Thing is when running this job, I always get the same error: 问题是在运行此作业时,我总是会遇到相同的错误:

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2757 in stage 13.0 failed 4 times, most recent failure: Lost task 2757.3 in stage 13.0 (TID 47681, hadoop-w-175.c.taboola-qa-01.internal): java.io.FileNotFoundException: /hadoop/spark/tmp/spark-local-20141201184944-ba09/36/shuffle_6_2757_2762 (Too many open files)
    java.io.FileOutputStream.open(Native Method)
    java.io.FileOutputStream.<init>(FileOutputStream.java:221)
    org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
    org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
    org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
    org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
    scala.collection.Iterator$class.foreach(Iterator.scala:727)
    scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
    org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    org.apache.spark.scheduler.Task.run(Task.scala:54)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:745)

I already increased my ulimits, by doing the following on all cluster machines: 通过在所有群集计算机上执行以下操作,我已经增加了ulimit:

echo "* soft nofile 900000" >> /etc/security/limits.conf
echo "root soft nofile 900000" >> /etc/security/limits.conf
echo "* hard nofile 990000" >> /etc/security/limits.conf
echo "root hard nofile 990000" >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
echo "session required pam_limits.so" >> /etc/pam.d/common-session-noninteractive

But doesn't fix my problem... 但是并不能解决我的问题...

The bdutil framework works in a way the user "hadoop" is the one running the job. bdutil框架的工作方式是用户“ hadoop”是运行作业的人。 The script that deploys the cluster, created a file /etc/security/limits.d/hadoop.conf that overrided the ulimit settings for "hadoop" user, which I wasn't aware of. 部署集群的脚本创建了一个文件/etc/security/limits.d/hadoop.conf,该文件覆盖了我不知道的“ hadoop”用户的ulimit设置。 By deleting this file, or alternatively setting the desired ulimits there, I was able to resolve the problem. 通过删除此文件或在此处设置所需的ulimit,我能够解决问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM