简体   繁体   中英

Convert Array to DenseVector in Spark DataFrame using Java

I am running Spark 2.3. I want to convert the column features in the following DataFrame from ArrayType to a DenseVector . I am using Spark in Java.

+---+--------------------+
| id|            features|
+---+--------------------+
|  0|[4.191401, -1.793...|
| 10|[-0.5674514, -1.3...|
| 20|[0.735613, -0.026...|
| 30|[-0.030161237, 0....|
| 40|[-0.038345724, -0...|
+---+--------------------+

root
 |-- id: integer (nullable = false)
 |-- features: array (nullable = true)
 |    |-- element: float (containsNull = false)

I have written the following UDF but it doesn't seem to be working:

private static UDF1 toVector = new UDF1<Float[], Vector>() {

    private static final long serialVersionUID = 1L;

    @Override
    public Vector call(Float[] t1) throws Exception {

        double[] DoubleArray = new double[t1.length];
        for (int i = 0 ; i < t1.length; i++)
        {
            DoubleArray[i] = (double) t1[i];
        }   
    Vector vector = (org.apache.spark.mllib.linalg.Vector) Vectors.dense(DoubleArray);
    return vector;
    }
}

I wish to extract the following features as a vector so that I can perform clustering on it.

I am also registering the UDF and then proceeding on to call it as follows:

spark.udf().register("toVector", (UserDefinedAggregateFunction) toVector);
df3 = df3.withColumn("featuresnew", callUDF("toVector", df3.col("feautres")));
df3.show();  

On running this snippet I am facing the following error:

ReadProcessData$1 cannot be cast to org.apache.spark.sql.expressions. UserDefinedAggregateFunction

The problem lies in how you are registering the udf in Spark. You should not use UserDefinedAggregateFunction which is not an udf but an udaf used for aggregations. Instead what you should do is:

spark.udf().register("toVector", toVector, new VectorUDT());

Then to use the registered function, use:

df3.withColumn("featuresnew", callUDF("toVector",df3.col("feautres")));

The udf itself should be slightly adjusted as follows:

UDF1 toVector = new UDF1<Seq<Float>, Vector>(){

  public Vector call(Seq<Float> t1) throws Exception {

    List<Float> L = scala.collection.JavaConversions.seqAsJavaList(t1);
    double[] DoubleArray = new double[t1.length()]; 
    for (int i = 0 ; i < L.size(); i++) { 
      DoubleArray[i]=L.get(i); 
    } 
    return Vectors.dense(DoubleArray); 
  } 
};

Note that in Spark 2.3+ you can create a scala-style udf that can be invoked directly. From this answer :

UserDefinedFunction toVector = udf(
  (Seq<Float> array) -> /* udf code or method to call */, new VectorUDT()
);

df3.withColumn("featuresnew", toVector.apply(col("feautres")));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM