Spark：sc.WholeTextFiles需要很长时间才能执行

Question

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. 我有一个集群，我执行wholeTextFiles ，它应该提取大约一百万个文本文件，总计大约10GB我有一个NameNode和两个DataNode，每个有30GB RAM，每个4个核心。 The data is stored in HDFS . 数据存储在HDFS 。

I don't run any special parameters and the job takes 5 hours to just read the data. 我没有运行任何特殊参数，作业需要5个小时才能读取数据。 Is that expected? 这是预期的吗？ are there any parameters that should speed up the read (spark configuration or partition, number of executors?) 是否有任何参数可以加快读取（火花配置或分区，执行器数量？）

I'm just starting and I've never had the need to optimize a job before 我刚刚开始，我以前从未需要优化工作

EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? 编辑：此外，有人可以准确解释整个文本功能的工作原理吗？ (not how to use it, but how it was programmed). （不是如何使用它，而是如何编程）。 I'm very interested in understand the partition parameter, etc. 我对理解分区参数等非常感兴趣。

EDIT 2: benchmark assessment 编辑2：基准评估

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. 所以我在整个文本文件之后尝试重新分区，问题是一样的，因为第一次读取仍然使用预定义数量的分区，因此没有性能改进。 Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile: 加载数据后，集群执行得非常好...在处理整个文本文件时，我在处理数据（对于200k文件）时有以下警告消息：

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

Would that be a reason of the bad performance? 这会是表现糟糕的原因吗？ How do I hedge that? 我该如何对冲呢？

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. 另外，在执行saveAsTextFile时，根据Ambari控制台的速度是19MB / s。 When doing a read with wholeTextFiles, I am at 300kb/s..... 当使用wholeTextFiles读取时，我的速度为300kb / s .....

It seems that by increase the number of partitions in wholeTextFile(path,partitions) , I am getting better performance. 似乎通过增加wholeTextFile(path,partitions)的分区数量，我的性能会越来越好。 But still only 8 tasks are running at the same time (my number of CPUs). 但是仍然只有8个任务同时运行（我的CPU数量）。 I'm benchmarking to observe the limit... 我正在进行基准测试以观察极限......

Answer 1

To summarize my recommendations from the comments: 从评论中总结我的建议：

HDFS is not a good fit for storing many small files. HDFS不适合存储许多小文件。 First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). 首先，NameNode将元数据存储在内存中，因此您可能拥有的文件和块的数量是有限的（典型服务器的最大约为100m块）。 Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. 接下来，每次读取文件时，首先查询NameNode的块位置，然后连接到存储该文件的DataNode。 Overhead of this connections and responses is really huge. 这种联系和响应的开销非常大。
Default settings should always be reviewed. 应始终检查默认设置。 By default Spark starts on YARN with 2 executors ( --num-executors ) with 1 thread each ( --executor-cores ) and 512m of RAM ( --executor-memory ), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks 默认情况下，Spark在YARN上启动，带有2个执行程序（ --num-executors ），每个--num-executors一个（ --executor-cores ）和512m的RAM（ --num-executors --executor-memory ），每个只有2个线程，每个512MB RAM，对于现实世界的任务来说真的很小

So my recommendation is: 所以我的建议是：

Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel 使用--num-executors 4 --executor-memory 12g --executor-cores 4启动Spark --num-executors 4 --executor-memory 12g --executor-cores 4可以提供更多并行性 - 在这种特殊情况下有16个线程，这意味着16个并行运行的任务
Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/ . 使用sc.wholeTextFiles读取文件，然后将它们转储到压缩序列文件中（例如，使用Snappy块级压缩），这里有一个如何完成此操作的示例：http： //0x0fff.com/spark-hdfs-integration / 。 This will greatly reduce the time needed to read them with the next iteration 这将大大减少下次迭代读取它们所需的时间

Spark：sc.WholeTextFiles需要很长时间才能执行

问题描述

1 个解决方案

解决方案1
7 已采纳 2015-01-20 13:02:13

Spark：sc.WholeTextFiles需要很长时间才能执行

问题描述

1 个解决方案

解决方案1 7 已采纳 2015-01-20 13:02:13

解决方案1
7 已采纳 2015-01-20 13:02:13