How to sort in Java via Spark using the MapReduce functions

Question

Hi I'm looking for a way to do a simple sort in Spark (using Java code) and using map reduce. I'm very new to this so a good explanation of how map/reduce works would be really extremely helpful. I've read some explanations that were okay but didn't talk about code at all which is more helpful to me.

I have an input data file which millions of ascii 100 byte records/or better yet 100 byte binary records. I want to sort on the first 10 bytes of each record/line. These files are about 10 TB, so it's a lot of data and I'm not sure what the fastest way to do this is. How would I go about doing this with map/reduce. Java is not my language so writing out the actual code would be extremely helpful.

All I'm doing now is

SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSimpleSort");
sparkConf.setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = jsc.textFile("hdfs://localhost:19000/hdfsfiles/ASCII500Mill", 10);
lines.sortBy( new Function<String, String>() 
    {  
        @Override
        public String call( String value ) {
          return value.substring(0,10);
        }
    },true,1);


jsc.textFile("hdfs://localhost:19000/hdfsfiles/ASCII500Mill", 10);
jsc.stop();

EDIT: So I'm still working on this, and really need help. I can perform the map function fine, create the keys, but then shouldn't I be able to just call sortByKey at that point? I read a somewhat similar question/answer on SO and it seems that the reduce step is necessary still. I just don't really understand that "why" or "how" of what these calls are doing, and what I can do to make this the most simple "MapReduce sort" algorithm I can make. I just need map().reduce().sortByKey() or whatever ordering I would need to make this work. Any help would be greatly appreciated.

EDIT2: I also see that people using map on a text file are usually (from what I've seen) been splitting a text file into individual words (what other people normally do with map and textfiles, and what's given as an example on the Spark Doc/Guide site). I'm already sorting on the lines (not individual words), so maybe I don't need map? I know I'm sorting by key, but that's no reason to do more than return a "mapped" RDD of the input file using each line by the first 10 bytes. But then again, I lose track of the original lines offset/position. Sorry about the ignorance here, I'm not used to programming in Java, especially the delegates, even though they're similar to C#, and not used to FlatMap or Spark so I'm way way way out of my element here. Again, any help, greatly appreciated. Thanks!

Answer 1

I found that this sorts, although it's not doing it by the first ten bytes of the record, rather by creating a map/dictionary with numbers to record, counting then sorting. It's not exactly what I wanted, but it should help anyone who comes here. I pulled the map/reduce part from spark documentation.

import java.io.*;
import java.util.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Timer;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import scala.*;

public class JavaSparkPi {
    public static void main(String[] args) throws Exception {
    SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi");
    sparkConf.setMaster("local");
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    JavaRDD<String> lines = jsc.textFile("hdfs://localhost:19000/hdfsfiles/500mill", 10); 
    String s = lines.first();
    System.out.println("First element in UNsorted array: " + s);
    JavaPairRDD<String, Integer> pairs = lines.mapToPair(p -> new Tuple2(p, 1));
    JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
    JavaPairRDD<String,Integer> sorted = counts.sortByKey();

    s = lines.first();
    System.out.println("First element in sorted array: " + s);
    lines.saveAsTextFile("hdfs://localhost:19000/hdfsfiles/500millOUT4");    
    jsc.stop();
  }
}

How to sort in Java via Spark using the MapReduce functions

Question

1 answers

solution1
0 2016-05-23 17:31:53

How to sort in Java via Spark using the MapReduce functions

Question

1 answers

solution1 0 2016-05-23 17:31:53

solution1
0 2016-05-23 17:31:53