简体   繁体   English

怎么把RDD [CassandraRow]转换成DataFrame?

[英]How to convert RDD[CassandraRow] to DataFrame?

Currently this is how I am transforming Cassandrarow RDD to dataframe: 当前,这就是我将Cassandrarow RDD转换为数据帧的方式:

val ssc = new StreamingContext(sc, Seconds(15))

val dstream = new ConstantInputDStream(ssc, ssc.cassandraTable("db", "table").select("createdon"))

import sqlContext.implicits._

dstream.foreachRDD{ rdd =>
    val dataframeJobs = rdd.map(w => (w.dataAsString)).map(_.split(":")).map(x =>(x(1))).map(_.split(" ")).map(x =>(x(1))).toDF("ondate")
}

As you can see, I am converting first cassandraRow rdd to string first, and then mapping to the format I want. 如您所见,我首先将cassandraRow rdd转换为字符串,然后再映射为所需的格式。 I find this method to get complicated as when the rdd contains multiple coloumns instead of just one (createdon) as shown in the example. 我发现这种方法变得很复杂,因为rdd包含多个列而不是如示例中所示仅包含一个(createdon)。

Is there any other alternative and easy way to convert cassandraRow RDD to dataframe? 还有其他任何简便的方法可以将cassandraRow RDD转换为数据帧吗?

My build.sbt is as follows: 我的build.sbt如下:

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(
  "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.1",
  "org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
  "org.apache.spark" %% "spark-sql" % "2.0.2",
  "org.apache.spark" %% "spark-streaming" % "2.0.2"
)

I figured out an alternative way which could work with any number of coloumns effectively: 我想出了一种可以有效处理任何数量的对话的替代方法:

rdd.keyBy(row => (row.getString("createdon"))).map(x => x._1).toDF("ondate") rdd.keyBy(row =>(row.getString(“ createdon”)))。map(x => x._1).toDF(“ ondate”)

Quoting the scaladoc of SparkContextFunctions (removing the implicit params): 引用SparkContextFunctions的scaladoc (删除隐式参数):

cassandraTable[T](keyspace: String, table: String): CassandraTableScanRDD[T] Returns a view of a Cassandra table as CassandraRDD. cassandraTable [T](键空间:字符串,表:字符串):CassandraTableScanRDD [T]以CassandraRDD的形式返回Cassandra表的视图。 This method is made available on SparkContext by importing com.datastax.spark.connector._ 通过导入com.datastax.spark.connector._可以在SparkContext上使用此方法。

Depending on the type parameter passed to cassandraTable, every row is converted to one of the following: 根据传递给cassandraTable的type参数,每一行都将转换为以下之一:

  • an CassandraRow object (default, if no type given) 一个CassandraRow对象(默认,如果未指定类型)
  • a tuple containing column values in the same order as columns selected by CassandraRDD#select 一个元组,其中的列值与CassandraRDD#select选择的列的顺序相同
  • object of a user defined class, populated by appropriate ColumnMapper 用户定义的类的对象,由适当的ColumnMapper填充

So, I'd recommend using the following: 因此,我建议使用以下内容:

ssc.cassandraTable[String]("db", "table").select("createdon")

That should give you the easiest possible way to access createdon per the docs. 这应该为您提供最简单的方法来访问每个文档上的createdon


I'm also wondering why you don't use DataFrame that spark-cassandra-connector supports as described in Datasets . 我也想知道,为什么你不使用如描述的火花卡桑德拉-接口支持数据帧的数据集 With that your code might get slightly simpler. 这样,您的代码可能会变得稍微简单一些。

You could try to replace Spark Streaming ( almost officially obsolete) with Spark SQL's Structured Streaming : 您可以尝试用Spark SQL的结构化流替换Spark Streaming( 几乎已经过时):

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. 结构化流是基于Spark SQL引擎构建的可伸缩且容错的流处理引擎。 You can express your streaming computation the same way you would express a batch computation on static data. 您可以像对静态数据进行批处理计算一样来表示流计算。 The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. 当流数据继续到达时,Spark SQL引擎将负责逐步递增地运行它并更新最终结果。

I'm not sure however if Cassandra Spark Connector supports it. 但是我不确定Cassandra Spark Connector是否支持它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM