简体   繁体   English

使用Spark从Cassandra和Cloudera Hadoop读取

[英]Read from Cassandra with Cloudera Hadoop using Spark

The scope is to read from HDFS, filter in Spark and write results to Cassandra. 范围是从HDFS读取,在Spark中过滤并将结果写入Cassandra。 I am packaging and running with SBT. 我正在打包和运行SBT。

Here is the problem: Reading from HDFS to Spark requires the following line in my sbt build file. 这里是问题:从HDFS读取到Spark需要在我的sbt构建文件中包含以下行。

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.5.0"

However, reading and writing to Cassandra via 但是,通过以下方式对Cassandra进行读写

val casRdd = sc.newAPIHadoopRDD(
  job.getConfiguration(),
  classOf[ColumnFamilyInputFormat],
  classOf[ByteBuffer],
  classOf[SortedMap[ByteBuffer, IColumn]])

does only work if the library dependency of the hadoop-client is either left out or changed to 0.1 or 1.2.0 or 2.2.0 (non CDH) - unfortunately then the HDFS read is not possible. 仅当hadoop-client的库依赖项被遗漏或更改为0.1或1.2.0或2.2.0(非CDH)时才起作用-不幸的是,HDFS无法读取。 If the hadoop-client line is added, the following Error is thrown when trying to read from Cassandra: 如果添加了hadoop-client行,则尝试从Cassandra读取时将引发以下错误:

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

I therefore conclude that the problem with Cassandra read/write seems to be an issue which is Cloudera related? 因此,我得出结论,Cassandra读/写问题似乎与Cloudera有关? Please note that the Cassandra read/write works by simply deleting the libraryDependencies line. 请注意,只需删除libraryDependencies行即可进行Cassandra读/写。

Since the HDFS and Cassandra read need to work in the same project, how can this issue be resolved? 由于HDFS和Cassandra阅读需要在同一项目中工作,因此如何解决此问题?

It seems you are trying to use Apache Hadoop distribution from Spark built against CDH. 看来您正在尝试使用针对CDH构建的Spark中的Apache Hadoop发行版。

Your project should never have to depend on hadoop-client as Spark already does. 您的项目应该永远不必像Spark那样依赖hadoop-client。 In our Sppark + Cassandra integration library Calliope we have a dependency on Spark - 在我们的Sppark + Cassandra集成库Calliope中,我们依赖Spark-

"org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided"

We have been using this library with Apache Hadoop HDFS, CDH HDFS and our own SnackFS. 我们一直将此库与Apache Hadoop HDFS,CDH HDFS和我们自己的SnackFS一起使用。 All you need to ennsure is that you deploy on the correect build of Spark. 您只需要确保在正确的Spark版本上进行部署即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM