转换 CassandraTableScanRDD org.apache.spark.rdd.RDD

Question

I have a following situation.我有以下情况。 I have large Cassandra table (with large number of columns) which i would like process with Spark.我有一个很大的 Cassandra 表（有很多列），我想用 Spark 处理它。 I want only selected columns to be loaded in to Spark ( Apply select and filtering on Cassandra server itself)我只想将选定的列加载到 Spark（在 Cassandra 服务器本身上应用选择和过滤）

 val eptable = 
 sc.cassandraTable("test","devices").select("device_ccompany","device_model","devi
 ce_type")

Above statement gives a CassandraTableScanRDD but how do i convert this in to DataSet/DataFrame ?上面的语句给出了一个 CassandraTableScanRDD 但我如何将它转换为 DataSet/DataFrame ？

Si there any other way i can do server side filtering of columns and get dataframes?我还有其他方法可以对列进行服务器端过滤并获取数据帧吗？

Answer 1

In DataStax Spark Cassandra Connector, you would read Cassandra data as a Dataset , and prune columns on the server-side as follows:在 DataStax Spark Cassandra Connector 中，您可以将 Cassandra 数据作为Dataset读取，并在服务器端修剪列，如下所示：

val df = spark
 .read
 .format("org.apache.spark.sql.cassandra")
 .options(Map( "table" -> "devices", "keyspace" -> "test" ))
 .load()

val dfWithColumnPruned = df
 .select("device_ccompany","device_model","device_type")

Note that the selection operation I do after reading is pushed to the server-side using Catalyst optimizations.请注意，我在阅读后所做的selection操作是使用 Catalyst 优化推送到服务器端的。 Refer this document for further information.有关更多信息，请参阅此文档。

转换 CassandraTableScanRDD org.apache.spark.rdd.RDD

问题描述

1 个解决方案

解决方案1
1 2018-03-06 10:56:09

转换 CassandraTableScanRDD org.apache.spark.rdd.RDD

问题描述

1 个解决方案

解决方案1 1 2018-03-06 10:56:09

解决方案1
1 2018-03-06 10:56:09