I am running Spark 2.3. I want to convert the column features
in the following DataFrame from ArrayType
to a DenseVector
. I am using Spark in Java.
+---+--------------------+
| id| features|
+---+--------------------+
| 0|[4.191401, -1.793...|
| 10|[-0.5674514, -1.3...|
| 20|[0.735613, -0.026...|
| 30|[-0.030161237, 0....|
| 40|[-0.038345724, -0...|
+---+--------------------+
root
|-- id: integer (nullable = false)
|-- features: array (nullable = true)
| |-- element: float (containsNull = false)
I have written the following UDF
but it doesn't seem to be working:
private static UDF1 toVector = new UDF1<Float[], Vector>() {
private static final long serialVersionUID = 1L;
@Override
public Vector call(Float[] t1) throws Exception {
double[] DoubleArray = new double[t1.length];
for (int i = 0 ; i < t1.length; i++)
{
DoubleArray[i] = (double) t1[i];
}
Vector vector = (org.apache.spark.mllib.linalg.Vector) Vectors.dense(DoubleArray);
return vector;
}
}
I wish to extract the following features as a vector so that I can perform clustering on it.
I am also registering the UDF and then proceeding on to call it as follows:
spark.udf().register("toVector", (UserDefinedAggregateFunction) toVector);
df3 = df3.withColumn("featuresnew", callUDF("toVector", df3.col("feautres")));
df3.show();
On running this snippet I am facing the following error:
ReadProcessData$1 cannot be cast to org.apache.spark.sql.expressions. UserDefinedAggregateFunction
The problem lies in how you are registering the udf
in Spark. You should not use UserDefinedAggregateFunction
which is not an udf
but an udaf
used for aggregations. Instead what you should do is:
spark.udf().register("toVector", toVector, new VectorUDT());
Then to use the registered function, use:
df3.withColumn("featuresnew", callUDF("toVector",df3.col("feautres")));
The udf
itself should be slightly adjusted as follows:
UDF1 toVector = new UDF1<Seq<Float>, Vector>(){
public Vector call(Seq<Float> t1) throws Exception {
List<Float> L = scala.collection.JavaConversions.seqAsJavaList(t1);
double[] DoubleArray = new double[t1.length()];
for (int i = 0 ; i < L.size(); i++) {
DoubleArray[i]=L.get(i);
}
return Vectors.dense(DoubleArray);
}
};
Note that in Spark 2.3+ you can create a scala-style udf
that can be invoked directly. From this answer :
UserDefinedFunction toVector = udf(
(Seq<Float> array) -> /* udf code or method to call */, new VectorUDT()
);
df3.withColumn("featuresnew", toVector.apply(col("feautres")));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.