如何使用spark-cassandra-connector将spark与cassandra连接？

Question

You must forgive my noobness but I'm trying to setup a spark cluster that connects to cassandra running a python script, currently I am using datastax enterprise to run cassandra on solr search mode. 您必须原谅我的笨拙，但我试图设置一个连接到运行python脚本的cassandra的spark集群，当前我正在使用datastax企业版在solr搜索模式下运行cassandra。 I understand that, in order to use the spark-cassandra connector that datastax provides, you must run cassandra in analytics mode (using -k option). 我了解，为了使用datastax提供的spark-cassandra连接器，必须在分析模式下运行cassandra（使用-k选项）。 Currently I have got it to work only using the dse spark version, for which, to make it work I followed the next steps: 目前，我只能使用dse spark版本使其工作，为此，我按照以下步骤操作：

Start dse cassandra in analytics mode 在分析模式下启动Dse Cassandra
change $PYTHONPATH env variable to /path/to/spark/dse/python:/path/to/spark/dse/python/lib/py4j-*.zip:$PYTHONPATH 将$ PYTHONPATH env变量更改为/path/to/spark/dse/python:/path/to/spark/dse/python/lib/py4j-*.zip:$PYTHONPATH
run as root the standalone script with python test-script.py 以root身份运行带有python test-script.py的独立脚本

Besides, I made another test using the spark alone (not dse version), trying to include the java packages that make driver classes accesible, I did: 此外，我还单独使用spark（不是dse版本）进行了另一项测试，试图包括使驱动程序类可访问的java包，我做到了：

Add spark.driver.extraClassPath = /path/to/spark-cassandra-connector-SNAPSHOT.jar to the file spark-defaults.conf 2.execute $SPARK_HOME/bin/spark-submit —packages com.datastax.spark:spark-cassandra... 将spark.driver.extraClassPath = /path/to/spark-cassandra-connector-SNAPSHOT.jar添加到文件spark-defaults.conf 2.执行$SPARK_HOME/bin/spark-submit —packages com.datastax.spark:spark-cassandra...

I also tried running pyspark shell and test if sc had the method cassandraTable to see if the driver was loaded but didn't work out, in both cases I get the following error message: 我还尝试运行pyspark shell并测试sc是否具有cassandraTable方法来查看驱动程序是否已加载但没有解决，在两种情况下，我都会收到以下错误消息：

AttributeError: 'SparkContext' object has no attribute 'cassandraTable'

My goal is to undestand what I must do to make the non-dse spark version connect with cassandra and have the methods from the driver available. 我的目标是不理解如何使non-dse spark版本与cassandra连接并使驱动程序中的方法可用。

I also want to know if it is possible to use the dse spark-cassandra connector with a cassandra node that is NOT running with dse. 我还想知道是否可以将dse spark-cassandra连接器与未与dse运行的cassandra节点一起使用。

Thanks for your help 谢谢你的帮助

Answer 1

Here is how to connect spark-shell to cassandra in non-dse version. 这是在非DSE版本中如何将spark-shell连接到cassandra的方法。

Copy spark-cassandra-connector jar to spark/spark-hadoop-directory/jars/ 将spark-cassandra-connector jar复制到spark/spark-hadoop-directory/jars/

spark-shell --jars ~/spark/spark-hadoop-directory/jars/spark-cassandra-connector-*.jar

in spark shell execute these commands 在Spark Shell中执行以下命令

sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
import  org.apache.spark.sql.cassandra._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val csc = new CassandraSQLContext(sc)

You will have to provide more parameters if your cassandra has password setup etc. :) 如果您的cassandra设置了密码等，则必须提供更多参数：)

Answer 2

I have used pyspark in a standalone python script. 我在独立的python脚本中使用pyspark。 I don't use DSE, I cloned cassandra-spark-connector from datastax's github repository and compiled with datastax instrucctions . 我不使用DSE，而是从datastax的github存储库中克隆了cassandra-spark-connector，并使用datastax 指令进行了编译。

In order to get access to spark connector within spark, I copied to jars folder inside spark installation. 为了访问spark中的spark连接器，我将其复制到spark安装内的jars文件夹中。

I think that it would be good for you as well: 我认为这对您也有好处：

 cp ~/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.11/spark-cassandra-connector-assembly-2.0.5-86-ge36c048.jar $SPARK_HOME/jars/

You could visit this where I explain my own experience setting up the environment. 您可以访问这里，在此我解释我自己的环境设置经验。

Once spark has access to Cassandra connector, you can use pyspark library as wrapper: 一旦spark可以访问Cassandra连接器，则可以使用pyspark库作为包装器：

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession

spark = SparkSession.builder \
  .appName('SparkCassandraApp') \
  .config('spark.cassandra.connection.host', 'localhost') \
  .config('spark.cassandra.connection.port', '9042') \
  .config('spark.cassandra.output.consistency.level','ONE') \
  .master('local[2]') \
  .getOrCreate()

ds = sqlContext \
  .read \
  .format('org.apache.spark.sql.cassandra') \
  .options(table='tablename', keyspace='keyspace_name') \
  .load()

ds.show(10)

In this example you can see the whole script. 在此示例中，您可以看到整个脚本。

如何使用spark-cassandra-connector将spark与cassandra连接？

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-05-09 07:08:42

解决方案2
1 2018-01-27 21:27:46

如何使用spark-cassandra-connector将spark与cassandra连接？

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-05-09 07:08:42

解决方案2 1 2018-01-27 21:27:46

解决方案1
1 已采纳 2016-05-09 07:08:42

解决方案2
1 2018-01-27 21:27:46