简体   繁体   English

将CassandraRDD转换为RDD [Array [String]]

[英]Convert CassandraRDD to RDD[Array[String]]

I have Cassandra Table and I have selected some columns to do Association rules on them. 我有Cassandra Table,并且选择了一些列来对它们进行关联规则。 I have created case class for each column to save them in it. 我为每个列创建了案例类以将其保存在其中。 I have the column data of type 我有类型的列数据

com.datastax.spark.connector.rdd.CassandraRDD[SuperStoreSalesRG] com.datastax.spark.connector.rdd.CassandraRDD [SuperStoreSalesRG]

where SuperStoreSalesRG is the case class for single column I want to convert it to 其中SuperStoreSalesRG是单列的案例类,我想将其转换为

RDD[Array[String]] RDD [Array [String]]

How to Do that ?! 怎么做 ?!

many thanks.. 非常感谢..

this is what I've tried so far 这是我到目前为止尝试过的

val test_spark_rdd = sc.cassandraTable("demo1", "orders4") 

case class SuperStoreSalesPC (ProductCategory: String) 
case class SuperStoreSalesCS (CustomerSegment: String) 
case class SuperStoreSalesRG (Region: String) 

val resultPC = test_spark_rdd.select("productcategory").as(SuperStoreSalesP‌​C) 
val resultCS = test_spark_rdd.select("customersegment").as(SuperStoreSalesC‌​S) 
val resultRG = test_spark_rdd.select("region").as(SuperStoreSalesRG)

I want to convert each of vals: resultPC, resultCS, resultRG in separate RDD[Array[String]] where these vals are the columns 我想转换每个val:resultPC,resultCS,resultRG在单独的RDD [Array [String]]中,其中这些val是列

After you separate the three columns "productcategory", "customersegment", "region" into three datasets resultPC, resultCS, resultRG , you can do the following to convert each of the datasets to RDD[Array[String]] "productcategory", "customersegment", "region"这三列分为三个数据集resultPC, resultCS, resultRG ,可以执行以下操作将每个datasets转换为RDD[Array[String]]

First step would be to use inbuilt collect_list function 第一步是使用内置的collect_list函数

import org.apache.spark.sql.functions._
val arrayedResultPC = resultPC.withColumn("productcategory", collect_list("productcategory"))

which would create datasets with following schema 这将使用以下schema创建datasets

root
 |-- productcategory: array (nullable = true)
 |    |-- element: string (containsNull = true)

You can do the same for other two datasets 您可以对其他两个数据集执行相同的操作

Final step would be to convert the collected datasets to RDD[Array[String]] 最后一步是将收集的datasets转换为RDD[Array[String]]

val arrayedRdd = arrayedResultPC.rdd.map(_.toSeq(0).asInstanceOf[mutable.WrappedArray[String]])

I hope the answer is helpful 我希望答案是有帮助的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM