[英]Convert CassandraRDD to RDD[Array[String]]
I have Cassandra Table and I have selected some columns to do Association rules on them. 我有Cassandra Table,并且选择了一些列来对它们进行关联规则。 I have created case class for each column to save them in it. 我为每个列创建了案例类以将其保存在其中。 I have the column data of type 我有类型的列数据
com.datastax.spark.connector.rdd.CassandraRDD[SuperStoreSalesRG] com.datastax.spark.connector.rdd.CassandraRDD [SuperStoreSalesRG]
where SuperStoreSalesRG is the case class for single column I want to convert it to 其中SuperStoreSalesRG是单列的案例类,我想将其转换为
RDD[Array[String]] RDD [Array [String]]
How to Do that ?! 怎么做 ?!
many thanks.. 非常感谢..
this is what I've tried so far 这是我到目前为止尝试过的
val test_spark_rdd = sc.cassandraTable("demo1", "orders4")
case class SuperStoreSalesPC (ProductCategory: String)
case class SuperStoreSalesCS (CustomerSegment: String)
case class SuperStoreSalesRG (Region: String)
val resultPC = test_spark_rdd.select("productcategory").as(SuperStoreSalesPC)
val resultCS = test_spark_rdd.select("customersegment").as(SuperStoreSalesCS)
val resultRG = test_spark_rdd.select("region").as(SuperStoreSalesRG)
I want to convert each of vals: resultPC, resultCS, resultRG in separate RDD[Array[String]] where these vals are the columns 我想转换每个val:resultPC,resultCS,resultRG在单独的RDD [Array [String]]中,其中这些val是列
After you separate the three columns "productcategory", "customersegment", "region"
into three datasets resultPC, resultCS, resultRG
, you can do the following to convert each of the datasets
to RDD[Array[String]]
将"productcategory", "customersegment", "region"
这三列分为三个数据集resultPC, resultCS, resultRG
,可以执行以下操作将每个datasets
转换为RDD[Array[String]]
First step would be to use inbuilt collect_list
function 第一步是使用内置的collect_list
函数
import org.apache.spark.sql.functions._
val arrayedResultPC = resultPC.withColumn("productcategory", collect_list("productcategory"))
which would create datasets
with following schema
这将使用以下schema
创建datasets
root
|-- productcategory: array (nullable = true)
| |-- element: string (containsNull = true)
You can do the same for other two datasets 您可以对其他两个数据集执行相同的操作
Final step would be to convert the collected datasets
to RDD[Array[String]]
最后一步是将收集的datasets
转换为RDD[Array[String]]
val arrayedRdd = arrayedResultPC.rdd.map(_.toSeq(0).asInstanceOf[mutable.WrappedArray[String]])
I hope the answer is helpful 我希望答案是有帮助的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.