Cassandra使用Spark到Hive

Question

I have a cassandra table like below and want to get records from cassandra using some conditions and put it in the hive table. 我有一个如下所示的cassandra表，并希望使用某些条件从cassandra中获取记录并将其放在配置单元表中。

Cassandra Table(Employee) Entry: 卡桑德拉表（员工）条目：

 Id   Name  Amount  Time
 1    abc   1000    2017041801
 2    def   1000    2017041802
 3    ghi   1000    2017041803
 4    jkl   1000    2017041804
 5    mno   1000    2017041805
 6    pqr   1000    2017041806
 7    stu   1000    2017041807

Assume that this table columns are of the datatype string. 假定此表列是数据类型字符串。 We have same schema in hive also. 我们在蜂巢中也有相同的架构。

Now i wanted to import cassandra record between 2017041801 to 2017041804 to hive or hdfs. 现在我想将2017041801至2017041804之间的卡桑德拉记录导入到蜂巢或hdfs。 In second run I will pull the incremental records based on the prev run. 在第二次运行中，我将基于上一次运行提取增量记录。

I am able to load the cassandra data into RDD using below syntax. 我可以使用以下语法将Cassandra数据加载到RDD中。

val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("mydb", "Employee")

Now my problem is how can i filter this records according to the between condition and persist the filtered records in hive or hive external table path. 现在我的问题是我如何才能根据之间的条件过滤此记录，并将过滤后的记录持久保存在配置单元或配置单元外部表路径中。

Unfortunately my Time column is not clustering key in cassandra table. 不幸的是，我的时间列不是cassandra表中的聚簇键。 So I am not able to use .where() clause. 因此，我无法使用.where（）子句。

I am new to this scala and spark. 我是这个scala和火花的新手。 So please kindly help out on this filter logic or any other better way of implementing this logic using dataframe, Please let me know. 因此，请提供有关此过滤器逻辑或使用数据框实现此逻辑的其他更好方法的帮助，请告诉我。

Thanks in advance. 提前致谢。

Answer 1

I recommend to use Connector Dataframe API for loading from C* https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md . 我建议使用Connector Dataframe API从C * https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md加载。
Use df.filter() call for predicates 使用df.filter（）调用谓词
saveAsTable() method to store data in hive. saveAsTable（）方法可将数据存储在配置单元中。

Here is spark 2.0 example for your case 这是您的案例的spark 2.0示例

val df = spark
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map( "table" -> "Employee", "keyspace" -> "mydb" ))
  .load()
df.filter("time between 2017041801 and 2017041804")
  .write.mode("overwrite").saveAsTable("hivedb.employee");

Cassandra使用Spark到Hive

问题描述

Cassandra Table(Employee) Entry: 卡桑德拉表（员工）条目：

1 个解决方案

解决方案1
0 2017-04-19 11:38:10

Cassandra使用Spark到Hive

问题描述

Cassandra Table(Employee) Entry: 卡桑德拉表（员工）条目：

1 个解决方案

解决方案1 0 2017-04-19 11:38:10

解决方案1
0 2017-04-19 11:38:10