简体   繁体   English

小文件的spark重新分区数据

[英]spark repartition data for small file

I am pretty new to Spark and I am using a cluster mainly for paralellizing purpose. 我对Spark很新,我使用的是一个主要用于paralellizing目的的集群。 I have a 100MB file, each line of which is processed by some algorithm, which is quite a heavy and long processing. 我有一个100MB的文件,每行都由一些算法处理,这是一个非常繁重和漫长的处理。

I want to use a 10 node cluster and parallelize the processing. 我想使用10节点集群并并行化处理。 I know the block size is more than 100MB , and I tried to repartition the textFile . 我知道块大小超过100MB ,我试图重新分区textFile If I understand well, this repartition method increases the number of partitions: 如果我理解的话,这种repartition方法会增加分区的数量:

JavaRDD<String> input = sc.textFile(args[0]);
input.repartition(10);

The issue is that when I deploy to the cluster, only a single node is effectively processing. 问题是,当我部署到群集时,只有一个节点正在有效地处理。 How can I manage to process the file in parallel? 如何管理并行处理文件?

Update 1: here's my spark-submit command: 更新1:这是我的spark-submit命令:

/usr/bin/spark-submit --master yarn --class mypackage.myclass --jars 
myjar.jar 
gs://mybucket/input.txt outfile

Update 2: After the partition, there are basically 2 operations : 更新2:分区后,基本上有2个操作:

JavaPairRDD<String, String> int_input = mappingToPair(input);
JavaPairRDD<String, String> output = mappingValues(int_input, option);
output.saveAsTextFile("hdfs://...");

where mappingToPair(...) is mappingToPair(...)位置

public JavaPairRDD<String, String> mappingToPair(JavaRDD<String> input){
        return input.mapToPair(new PairFunction<String, String, String>() {
            public Tuple2<String, String> call(String line) {
                String[] arrayList = line.split("\t", 2);
                return new Tuple2(arrayList[0], arrayList[1]);
            }
        });
    }

and mappingValues(...) is a method of the following type: mappingValues(...)是以下类型的方法:

public JavaPairRDD<String,String> mappingValues(JavaPairRDD<String,String> rdd, final String option){
        return rdd.mapValues(
                new Function<String, String>() {
                    // here the algo processing takes place...
                }
        )
}

There could be multiple issues here: 这里可能存在多个问题:

  1. The file is only one block big. 该文件只有一个块大。 Reading this with multiple executors is not useful at all, since the HDFS node can serve one node with full speed, or two nodes with half the speed (plus overhead), etc.. Executor count becomes useful (for the read step) when you have multiple blocks scattered across different HDFS nodes. 使用多个执行程序读取此函数根本没用,因为HDFS节点可以全速运行一个节点,或者速度减半(加上开销)的两个节点等等。执行程序计数变得有用(对于读取步骤)有多个块分散在不同的HDFS节点上。
  2. It is also possible that you are storing the file in a non-splittable compressed format, so the input step can only read it with one executor, even if it would be 100 times as big as the block size. 您也可能以不可拆分的压缩格式存储文件,因此输入步骤只能用一个执行程序读取它,即使它是块大小的100倍。
  3. You do not chain the repartition(10) call into your flow, so it is not effective at all. 您不会将repartition(10)调用链接到您的流程中,因此它根本无效。 If you replace this line: input.repartition(10); 如果你替换这一行: input.repartition(10); with this one: input = input.repartition(10); 用这一个: input = input.repartition(10); it will be used, and it should split the RDD into multiple ones before continuing to the next step. 它将被使用,它应该将RDD分成多个,然后再继续下一步。

Please note that repartitioning can make your process even longer, since the data has to be splitted and transferred to the other machines, which can be easily be bottlenecked by slow network. 请注意,重新分区可以使您的流程更长,因为必须将数据拆分并传输到其他计算机,这可能很容易被慢速网络瓶颈。

This is especially the case when you use the client deploy mode. 使用客户端部署模式时尤其如此。 This means that the first executor (the driver) is your local Spark instance you submit from. 这意味着第一个执行程序(驱动程序)是您提交的本地Spark实例。 So it will first download all the data to the driver from the cluster, and then upload it back to the other YARN nodes after the partitioning. 因此,它将首先从群集中将所有数据下载到驱动程序,然后在分区后将其上载回其他YARN节点。

I could go on about this, but the main thing I'm trying to say is: the process might even run faster on one executor if your algorithm is very simple, instead of partitioning, transferring, and then running the algorithm on all executors in parallel. 我可以继续讨论这个问题,但我想说的主要是:如果你的算法很简单,那么进程甚至可以在一个执行器上运行得更快,而不是分区,传输,然后在所有执行器上运行算法平行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM