简体   繁体   中英

Read from Cassandra with Cloudera Hadoop using Spark

The scope is to read from HDFS, filter in Spark and write results to Cassandra. I am packaging and running with SBT.

Here is the problem: Reading from HDFS to Spark requires the following line in my sbt build file.

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.5.0"

However, reading and writing to Cassandra via

val casRdd = sc.newAPIHadoopRDD(
  job.getConfiguration(),
  classOf[ColumnFamilyInputFormat],
  classOf[ByteBuffer],
  classOf[SortedMap[ByteBuffer, IColumn]])

does only work if the library dependency of the hadoop-client is either left out or changed to 0.1 or 1.2.0 or 2.2.0 (non CDH) - unfortunately then the HDFS read is not possible. If the hadoop-client line is added, the following Error is thrown when trying to read from Cassandra:

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

I therefore conclude that the problem with Cassandra read/write seems to be an issue which is Cloudera related? Please note that the Cassandra read/write works by simply deleting the libraryDependencies line.

Since the HDFS and Cassandra read need to work in the same project, how can this issue be resolved?

It seems you are trying to use Apache Hadoop distribution from Spark built against CDH.

Your project should never have to depend on hadoop-client as Spark already does. In our Sppark + Cassandra integration library Calliope we have a dependency on Spark -

"org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided"

We have been using this library with Apache Hadoop HDFS, CDH HDFS and our own SnackFS. All you need to ennsure is that you deploy on the correect build of Spark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM