filtering null values from an RDD<Vector> spark

Question

I have a dataset of doubles in form of JavaRDD. I want to remove the rows(vector) containing null values. I was going to use filter function in order to do that but cannot figure out how to to do it. I am pretty new to spark and mllib and would really appreciate it if you could help me out.This is how my parsed data looks like:

String path = "data.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
  new Function<String, Vector>() {
    public Vector call(String s) {
      String[] sarray = s.split(" ");
      double[] values = new double[sarray.length];
      for (int i = 0; i < sarray.length; i++)
        values[i] = Double.parseDouble(sarray[i]);
      return Vectors.dense(values);
    }
  }
);

Answer 1

Checking a vector[i] element against null might put you in the clear?

And then perform an operation similar to vector.remove(n). Where "n" is the element to be removed from the vector.

Answer 2

Vector values = Vectors.dense(new double[vector_length]);
    parsedData = parsedData.filter((Vector s) -> {
         return !s.equals(Vectors.dense(new double[vector_length]));
    });

As mentioned in the comments, RDD vector can't be NULL. However, you might want to get red of empty (Zero) vectors utilizing the filter method. This can be done by creating an empty vector and filtering it out.

filtering null values from an RDD<Vector> spark

Question

2 answers

solution1
0 2015-04-30 20:13:50

solution2
0 2019-02-07 20:58:26

filtering null values from an RDD<Vector> spark

Question

2 answers

solution1 0 2015-04-30 20:13:50

solution2 0 2019-02-07 20:58:26

solution1
0 2015-04-30 20:13:50

solution2
0 2019-02-07 20:58:26