简体   繁体   中英

spark, cassandra, streaming, python, error, database, kafka

im trying to save my streaming data from spark to cassandra, spark is conected to kafka and its working ok, but saving to cassandra its making me become crazy. Im using spark 2.0.2, kafka 0.10 and cassandra 2.23,

this is how im submiting to spark

spark-submit --verbose --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars /tmp/pyspark-cassandra-0.3.5.jar --driver-class-path /tmp/pyspark-cassandra-0.3.5.jar --py-files /tmp/pyspark-cassandra-0.3.5.jar --conf spark.cassandra.connection.host=localhost /tmp/direct_kafka_wordcount5.py localhost:9092 testing

and this is my code it just a little modification from the spark examples, its works but i cant save this data to cassandra....

and this what im trying to do but just with the count result http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/

    from __future__ import print_function
import sys
import os
import time
import pyspark_cassandra
import pyspark_cassandra.streaming
from pyspark_cassandra import CassandraSparkContext
import urllib
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max
from pyspark.sql.types import FloatType
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 1)
    sqlContext = SQLContext(sc)
    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    lines = kvs.map(lambda x: x[1])
    counts=lines.count()
    counts.saveToCassandra("spark", "count")
    counts.pprint()
    ssc.start()
    ssc.awaitTermination()

i got this error,

Traceback (most recent call last): File "/tmp/direct_kafka_wordcount5.py", line 88, in counts.saveToCassandra("spark", "count")

Pyspark Casasndra stopped being updated a while ago and the latest version only supports up to Spark 1.6 https://github.com/TargetHolding/pyspark-cassandra

Additionally

counts=lines.count() // Returns data to the driver (not an RDD)

counts is now an Integer. This means the function saveToCassandra doesn't apply since that is a function for RDDs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM