简体   繁体   中英

filtering null values from an RDD<Vector> spark

I have a dataset of doubles in form of JavaRDD. I want to remove the rows(vector) containing null values. I was going to use filter function in order to do that but cannot figure out how to to do it. I am pretty new to spark and mllib and would really appreciate it if you could help me out.This is how my parsed data looks like:

String path = "data.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
  new Function<String, Vector>() {
    public Vector call(String s) {
      String[] sarray = s.split(" ");
      double[] values = new double[sarray.length];
      for (int i = 0; i < sarray.length; i++)
        values[i] = Double.parseDouble(sarray[i]);
      return Vectors.dense(values);
    }
  }
);

Checking a vector[i] element against null might put you in the clear?

And then perform an operation similar to vector.remove(n). Where "n" is the element to be removed from the vector.

Vector values = Vectors.dense(new double[vector_length]);
    parsedData = parsedData.filter((Vector s) -> {
         return !s.equals(Vectors.dense(new double[vector_length]));
    });

As mentioned in the comments, RDD vector can't be NULL. However, you might want to get red of empty (Zero) vectors utilizing the filter method. This can be done by creating an empty vector and filtering it out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM