I have dataframe like this:
+------+-----+-------------------+--------------------+
| Id|Label| Timestamp| Signal_list|
+------+-----+-------------------+--------------------+
|A05439| 1|2014-05-20 05:05:21|[-116, -123, -129...|
|A06392| 1|2013-12-27 04:12:33|[260, 314, 370, 4...|
|A08192| 1|2014-06-03 04:06:15|[334, 465, 628, 8...|
|A08219| 3|2013-12-31 03:12:41|[-114, -140, -157...|
|A02894| 2|2013-10-28 06:10:53|[109, 139, 170, 1...|
This dataframe signal list have 9k elements, I want to convert the signal list column into vector. I tried the below UDF :
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Long]) => {
Vectors.dense(array.toArray)
})
val afWithVector = afLabel.select("*").withColumn("Signal_list", convertUDF($"Signal_list"))
But it gives error:
console>:39: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.ml.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.ml.linalg.Vector
cannot be applied to (Array[Long])
Vectors.dense(array.toArray)
Dataframe schema:
|-- Id: string (nullable = true)
|-- Label: integer (nullable = true)
|-- Timestamp: string (nullable = true)
|-- Signal_list: array (nullable = true)
| |-- element: long (containsNull = true)
I'm new at using scala, an answer using pyspark will be more helpful.
The UDF
is nearly correct. The problem lies in that a vector in Spark can only use doubles, longs are not accepted. The change would look like this in Scala:
val convertUDF = udf((array : Seq[Long]) => {
Vectors.dense(array.toArray.map(_.toDouble))
})
In Python I believe it would look like this:
udf(lambda vs: Vectors.dense([float(i) for i in vs]), VectorUDT())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.