[英]spark repartition data for small file
I am pretty new to Spark
and I am using a cluster mainly for paralellizing purpose. 我对
Spark
很新,我使用的是一个主要用于paralellizing目的的集群。 I have a 100MB file, each line of which is processed by some algorithm, which is quite a heavy and long processing. 我有一个100MB的文件,每行都由一些算法处理,这是一个非常繁重和漫长的处理。
I want to use a 10 node cluster and parallelize the processing. 我想使用10节点集群并并行化处理。 I know the block size is more than
100MB
, and I tried to repartition the textFile
. 我知道块大小超过
100MB
,我试图重新分区textFile
。 If I understand well, this repartition
method increases the number of partitions: 如果我理解的话,这种
repartition
方法会增加分区的数量:
JavaRDD<String> input = sc.textFile(args[0]);
input.repartition(10);
The issue is that when I deploy to the cluster, only a single node is effectively processing. 问题是,当我部署到群集时,只有一个节点正在有效地处理。 How can I manage to process the file in parallel?
如何管理并行处理文件?
Update 1: here's my spark-submit
command: 更新1:这是我的
spark-submit
命令:
/usr/bin/spark-submit --master yarn --class mypackage.myclass --jars
myjar.jar
gs://mybucket/input.txt outfile
Update 2: After the partition, there are basically 2 operations : 更新2:分区后,基本上有2个操作:
JavaPairRDD<String, String> int_input = mappingToPair(input);
JavaPairRDD<String, String> output = mappingValues(int_input, option);
output.saveAsTextFile("hdfs://...");
where mappingToPair(...)
is mappingToPair(...)
位置
public JavaPairRDD<String, String> mappingToPair(JavaRDD<String> input){
return input.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String line) {
String[] arrayList = line.split("\t", 2);
return new Tuple2(arrayList[0], arrayList[1]);
}
});
}
and mappingValues(...)
is a method of the following type: 和
mappingValues(...)
是以下类型的方法:
public JavaPairRDD<String,String> mappingValues(JavaPairRDD<String,String> rdd, final String option){
return rdd.mapValues(
new Function<String, String>() {
// here the algo processing takes place...
}
)
}
There could be multiple issues here: 这里可能存在多个问题:
repartition(10)
call into your flow, so it is not effective at all. repartition(10)
调用链接到您的流程中,因此它根本无效。 If you replace this line: input.repartition(10);
input.repartition(10);
with this one: input = input.repartition(10);
input = input.repartition(10);
it will be used, and it should split the RDD into multiple ones before continuing to the next step. Please note that repartitioning can make your process even longer, since the data has to be splitted and transferred to the other machines, which can be easily be bottlenecked by slow network. 请注意,重新分区可以使您的流程更长,因为必须将数据拆分并传输到其他计算机,这可能很容易被慢速网络瓶颈。
This is especially the case when you use the client deploy mode. 使用客户端部署模式时尤其如此。 This means that the first executor (the driver) is your local Spark instance you submit from.
这意味着第一个执行程序(驱动程序)是您提交的本地Spark实例。 So it will first download all the data to the driver from the cluster, and then upload it back to the other YARN nodes after the partitioning.
因此,它将首先从群集中将所有数据下载到驱动程序,然后在分区后将其上载回其他YARN节点。
I could go on about this, but the main thing I'm trying to say is: the process might even run faster on one executor if your algorithm is very simple, instead of partitioning, transferring, and then running the algorithm on all executors in parallel. 我可以继续讨论这个问题,但我想说的主要是:如果你的算法很简单,那么进程甚至可以在一个执行器上运行得更快,而不是分区,传输,然后在所有执行器上运行算法平行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.