Extract only certain columns in Java Spark

Question

I have a file with 10 columns. What's the most elegant way to extract only first 3 columns or specific columns?

For example, this is how my file looks like:

john,smith,84,male,kansas
john,doe,48,male,california
tim,jones,22,male,delaware

And I want to extract into this:

[john, smith, kansas]
[john, doe, california]
[tim, jones, delaware]

What I have is this, but it doesn't specifically chose the columns that I want:

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(lines.split(",")))
                 .forEach(person -> LOG.info(person));

I read the following two Stackoverflow posts but I still can't decide how to do this.

EDIT: I ended up doing the following:

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
    peopleRDD.cache().map(lines -> Arrays.asList(new String[]{lines.split(",")[0], 
                                                        lines.split(",")[1], 
                                                        lines.split(",")[3]}
                     .forEach(person -> LOG.info(person));

Not the most elegant solution but if you have a better way, please post here. Thanks.

Answer 1

EDIT : Apologies, I just realized you were asking for a Java solution, but I've used Scala. Only the 3rd of my suggestions has an equivalent in Java (added at the bottom of the answer)... Spark is really much nicer in Scala though :-)

One way is to perform the split , then pattern match on the result to select the columns you want:

peopleRDD.cache().map(_.split(",") match { case Array(a,b,_,_,e) => List(a,b,e) })

Another (depending on which combinations of elements you want) is to use take and drop , using a val to avoid splitting repeatedly.

peopleRDD.cache().map{ line => 
    val parts = line.split(",") 
    parts.take(2) ++ parts.drop(4)
}

(You can add a toList after the split if you want a List rather than an Array for each result element in the RDD)

In fact the same approach can be used to simplify your original solution, eg:

peopleRDD.cache().map{ line => 
  val parts = line.split(",")
  List(parts[0], parts[1], parts[4])
}

In Java8, you can probably do the equivalent, which is a slight improvement as we avoid calling split repeatedly - something like:

peopleRDD.cache().map( line -> {
  Array<String> parts = line.split(",");
  Arrays.asList(new String[]{parts[0], parts[1], parts[4]});
})

Extract only certain columns in Java Spark

Question

1 answers

solution1
1 ACCPTED 2016-04-27 19:00:31

Extract only certain columns in Java Spark

Question

1 answers

solution1 1 ACCPTED 2016-04-27 19:00:31

solution1
1 ACCPTED 2016-04-27 19:00:31