[英]Spark Dataframe of WrappedArray to Dataframe[Vector]
I have a spark Dataframe df
with the following schema: 我有一个具有以下架构的Spark Dataframe
df
:
root
|-- features: array (nullable = true)
| |-- element: double (containsNull = false)
I would like to create a new Dataframe where each row will be a Vector of Double
s and expecting to get the following schema: 我想创建一个新的Dataframe,其中每一行都是
Double
的Vector,并希望获得以下架构:
root
|-- features: vector (nullable = true)
So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala ) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows. 到目前为止,我有以下代码(受本文影响: 在Scala中将Spark Dataframe(带有WrappedArray)转换为RDD [labelPoint] ),但我担心这是有问题的,因为计算甚至是合理的计算都需要很长时间。行数。 Also, if there are too many rows the application will crash with a heap space exception.
此外,如果行太多,应用程序将崩溃,并出现堆空间异常。
val clustSet = df.rdd.map(r => {
val arr = r.getAs[mutable.WrappedArray[Double]]("features")
val features: Vector = Vectors.dense(arr.toArray)
features
}).map(Tuple1(_)).toDF()
I suspect that the instruction arr.toArray
is not a good Spark practice in this case. 我怀疑在这种情况下,
arr.toArray
指令不是一个好的Spark做法。 Any clarification would be very helpful. 任何澄清将非常有帮助。
Thank you! 谢谢!
It's because .rdd
have to unserialize objects from internal in-memory format and it is very time consuming. 这是因为
.rdd
必须从内部内存格式中反序列化对象,这非常耗时。
It's ok to use .toArray
- you are operating on row level, not collecting everything to the driver node. 可以使用
.toArray
您在行级别进行操作,而不是将所有内容收集到驱动程序节点。
You can do this very easy with UDFs: 您可以使用UDF轻松完成此操作:
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = dataset
.withColumn("features", convertUDF('features))
Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD 代码来自于此答案: 将ArrayType(FloatType,false)转换为VectorUTD
However there author of the question didn't ask about differences 但是问题的作者没有询问差异
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.