简体   繁体   English

使用JavaRDD进行Spark排序<String>

[英]Spark Sorting with JavaRDD<String>

Let's say I have a file with line of strings and I import it to a JavaRDD, if I am trying to sort the strings and export as a new file, how should I do it? 假设我有一个带有字符串行的文件,并将其导入到JavaRDD中,如果我试图对字符串进行排序并导出为新文件,该怎么办? The code below is my attempt and it is not working 下面的代码是我的尝试,它不起作用

JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
JavaRDD<String> lines = sparkContext.textFile(args[0]);
JavaRDD<String> sorted = lines.sortBy(i->i, true,1);
sorted.saveAsTextFile(args[1]);

What I mean by "not working" is that the output file is not sorted. 我所说的“不工作”是指输出文件未排序。 I think the issue is with my "i->i" code, I am not sure how to make it sort with the compare method of strings as each "i" will be a string (also not sure how to make it compare between different "i" 我认为问题出在我的“ i-> i”代码上,我不确定如何使用字符串的compare方法对其进行排序,因为每个“ i”将是一个字符串(也不确定如何在不同的字符串之间进行比较“一世”

EDIT I have modified the code as per the comments, I suspect the file was being read as 1 giant string. 编辑我已经按照注释修改了代码,我怀疑文件被读取为1个巨大的字符串。

JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
long start  = System.currentTimeMillis();

List<String> array = buildArrayList(args[0]);
JavaRDD<String> lines = sparkContext.parallelize(array);
JavaRDD<String> sorted = lines.sortBy(i->i, true, 1);
sorted.saveAsTextFile(args[1]);

Still not sorting it :( 仍然没有排序:(

I made a little research. 我做了一点研究。 Your code is correct. 您的代码是正确的。 Here are the samples which I tested: 这是我测试的示例:

Spark initizalization 火花初始化

SparkSession spark = SparkSession.builder().appName("test")
        .config("spark.debug.maxToStringFields", 10000)
        .config("spark.sql.tungsten.enabled", true)
        .enableHiveSupport().getOrCreate();

JavaSparkContext jSpark = new JavaSparkContext(spark.sparkContext());

Example for RDD RDD示例

//RDD
JavaRDD rdd = jSpark.parallelize(Arrays.asList("z", "b", "c", "a"));
JavaRDD sorted = rdd.sortBy(i -> i, true, 1);
List<String> result = sorted.collect();
result.stream().forEach(i -> System.out.println(i));

The output is 输出是

a
b
c
z

You also can use dataset API //Dataset 您还可以使用数据集API //数据集

Dataset<String> stringDataset = spark.createDataset(Arrays.asList("z", "b", "c", "a"), Encoders.STRING());
Dataset<String> sortedDataset = stringDataset.sort(stringDataset.col(stringDataset.columns()[0]).desc()); //by defualt is ascending order
result = sortedDataset.collectAsList();
result.stream().forEach(i -> System.out.println(i));

The output is 输出是

z
c
b
a

Your problem I think is that your text file have a specific lines separator. 我认为您的问题是您的文本文件具有特定的行分隔符。 If it's so - you can use flatMap function to split your giant text string into line strings. 如果是这样,则可以使用flatMap函数将巨型文本字符串拆分为行字符串。 Here the example with Dataset //flatMap example 这里的数据集示例// flatMap示例

Dataset<String> singleLineDS= spark.createDataset(Arrays.asList("z:%b:%c:%a"),  Encoders.STRING());
Dataset<String> splitedDS = singleLineDS.flatMap(i->Arrays.asList(i.split(":%")).iterator(),Encoders.STRING());
Dataset<String> sortedSplitedDs = splitedDS.sort(splitedDS.col(splitedDS.columns()[0]).desc());
result = sortedSplitedDs.collectAsList();
result.stream().forEach(i -> System.out.println(i));

So you should find which separator is in your text file and adopt the code above for your task 因此,您应该在文本文件中找到哪个分隔符,并采用上面的代码执行任务

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM