简体   繁体   English

使用Scala将org.apache.spark.mllib.linalg.Vector RDD转换为Spark中的DataFrame

[英]Convert an org.apache.spark.mllib.linalg.Vector RDD to a DataFrame in Spark using Scala

I have a org.apache.spark.mllib.linalg.Vector RDD that [Int Int Int] . 我有一个[Int Int Int]的org.apache.spark.mllib.linalg.Vector RDD。 I am trying to convert this into a dataframe using this code 我试图使用此代码将其转换为数据帧

import sqlContext.implicits._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.ArrayData

vectrdd belongs to the type org.apache.spark.mllib.linalg.Vector vectrdd属于org.apache.spark.mllib.linalg.Vector类型

val vectarr = vectrdd.toArray()
case class RFM(Recency: Integer, Frequency: Integer, Monetary: Integer)
val df = vectarr.map { case Array(p0, p1, p2) => RFM(p0, p1, p2) }.toDF()

I am getting the following error 我收到以下错误

warning: fruitless type test: a value of type         
org.apache.spark.mllib.linalg.Vector cannot also be a Array[T]
val df = vectarr.map { case Array(p0, p1, p2) => RFM(p0, p1, p2) }.toDF()

error: pattern type is incompatible with expected type;
found   : Array[T]
required: org.apache.spark.mllib.linalg.Vector
val df = vectarr.map { case Array(p0, p1, p2) => RFM(p0, p1, p2) }.toDF()

The second method i tried is this 我尝试的第二种方法是这样的

val vectarr=vectrdd.toArray().take(2)
case class RFM(Recency: String, Frequency: String, Monetary: String)
val df = vectrdd.map { case (t0, t1, t2) => RFM(p0, p1, p2) }.toDF()

I got this error 我收到了这个错误

error: constructor cannot be instantiated to expected type;
found   : (T1, T2, T3)
required: org.apache.spark.mllib.linalg.Vector
val df = vectrdd.map { case (t0, t1, t2) => RFM(p0, p1, p2) }.toDF()

I used this example as a guide >> Convert RDD to Dataframe in Spark/Scala 我使用此示例作为指南>> 在Spark / Scala中将RDD转换为Dataframe

vectarr will have type of Array[org.apache.spark.mllib.linalg.Vector] , so in the pattern matching you cannot match Array(p0, p1, p2) because what is being matched is a Vector, not Array. vectarr将具有Array[org.apache.spark.mllib.linalg.Vector]类型,因此在模式匹配中,您无法匹配Array(p0, p1, p2)因为匹配的是Vector,而不是Array。

Also, you should not do val vectarr = vectrdd.toArray() - this will convert the RDD to Array and then the final call to toDF will not work, since toDF only works on RDD's. 此外,您不应该执行val vectarr = vectrdd.toArray() - 这会将RDD转换为Array,然后最终调用toDF将无法工作,因为toDF仅适用于RDD。

The correct line would be (provided you change RFM to have Doubles) 正确的行是(假设您将RFM更改为具有双打)

val df = vectrdd.map(_.toArray).map { case Array(p0, p1, p2) => RFM(p0, p1, p2)}.toDF()

or, equivalently, replace val vectarr = vectrdd.toArray() (which produces Array[Vector] ) with val arrayRDD = vectrdd.map(_.toArray()) (producing RDD[Array[Double]] ) 或者,等效地用val arrayRDD = vectrdd.map(_.toArray())替换val vectarr = vectrdd.toArray() (生成Array[Vector] val arrayRDD = vectrdd.map(_.toArray()) (生成RDD[Array[Double]]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 org.apache.spark.mllib.linalg.Vector到DataFrame标量 - org.apache.spark.mllib.linalg.Vector to DataFrame scala 将三个分离的rdd [org.apache.spark.mllib.linalg.Vector]火花化为单个rdd [Vector] - spark(scala) three separated rdd[org.apache.spark.mllib.linalg.Vector] to a single rdd[Vector] 将Spark数据帧转换为org.apache.spark.rdd.RDD [org.apache.spark.mllib.linalg.Vector] - Convert Spark Data Frame to org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] 如何将 RDD[org.apache.spark.sql.Row] 转换为 RDD[org.apache.spark.mllib.linalg.Vector] - How to convert RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector] 将RDD [org.apache.spark.sql.Row]转换为RDD [org.apache.spark.mllib.linalg.Vector] - Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector] 如何有效计算Spark中RDD [org.apache.spark.mllib.linalg.Vector]的中位数? - How to calculate median over RDD[org.apache.spark.mllib.linalg.Vector] in Spark efficiently? 如何在RDD“ org.apache.spark.rdd.RDD [(Long,org.apache.spark.mllib.linalg.Vector)]的每一行上应用” Sum(vi * ln(vi))” - How to apply “Sum(vi * ln(vi))” on each row of an RDD “org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)]” 无法在 Spark 2.0 中的数据集 [(scala.Long, org.apache.spark.mllib.linalg.Vector)] 上运行 LDA - Can't run LDA on Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)] in Spark 2.0 Spark ClassCastException:无法将 JavaRDD 转换为 org.apache.spark.mllib.linalg.Vector - Spark ClassCastException: JavaRDD cannot be cast to org.apache.spark.mllib.linalg.Vector 使用Apache Spark中的Scala - MLLib转换LabeledPoint中的Vector的RDD - Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM