简体   繁体   English

将WrappedArray的Spark数据帧转换为Dataframe [Vector]

[英]Spark Dataframe of WrappedArray to Dataframe[Vector]

I have a spark Dataframe df with the following schema: 我有一个具有以下架构的Spark Dataframe df

root
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = false)

I would like to create a new Dataframe where each row will be a Vector of Double s and expecting to get the following schema: 我想创建一个新的Dataframe,其中每一行都是Double的Vector,并希望获得以下架构:

root
     |-- features: vector (nullable = true)

So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala ) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows. 到目前为止,我有以下代码(受本文影响: 在Scala中将Spark Dataframe(带有WrappedArray)转换为RDD [labelPoint] ),但我担心这是有问题的,因为计算甚至是合理的计算都需要很长时间。行数。 Also, if there are too many rows the application will crash with a heap space exception. 此外,如果行太多,应用程序将崩溃,并出现堆空间异常。

val clustSet = df.rdd.map(r => {
          val arr = r.getAs[mutable.WrappedArray[Double]]("features")
          val features: Vector = Vectors.dense(arr.toArray)
          features
          }).map(Tuple1(_)).toDF()

I suspect that the instruction arr.toArray is not a good Spark practice in this case. 我怀疑在这种情况下, arr.toArray指令不是一个好的Spark做法。 Any clarification would be very helpful. 任何澄清将非常有帮助。

Thank you! 谢谢!

It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming. 这是因为.rdd必须从内部内存格式中反序列化对象,这非常耗时。

It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node. 可以使用.toArray您在行级别进行操作,而不是将所有内容收集到驱动程序节点。

You can do this very easy with UDFs: 您可以使用UDF轻松完成此操作:

import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
  Vectors.dense(array.toArray)
})
val withVector = dataset
  .withColumn("features", convertUDF('features))

Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD 代码来自于此答案: 将ArrayType(FloatType,false)转换为VectorUTD

However there author of the question didn't ask about differences 但是问题的作者没有询问差异

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM