简体   繁体   English

如何启用从Cassandra到Spark的流媒体?

[英]How to enable streaming from Cassandra to Spark?

I have the following spark job : 我有以下火花工作

from __future__ import print_function

import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext

if __name__ == "__main__":

    conf = SparkConf().setAppName("PySpark Cassandra Test")
    sc = CassandraSparkContext(conf=conf)
    stream = StreamingContext(sc, 2)

    rdd=sc.cassandraTable("keyspace2","users").collect()
    #print rdd
    stream.start()
    stream.awaitTermination()
    sc.stop() 

When I run this, it gives me the following error : 当我运行它时,它给我以下错误

ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute

the shell script I run: 我运行的shell脚本

./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py

Comparing spark streaming with kafka, I have this line missing from the above code: 比较火花流与kafka,我从上面的代码中遗漏了这一行:

kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})

where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. 我实际上在使用createStream但对于cassandra,我在文档上看不到这样的内容。 How do I start the streaming between spark streaming and cassandra? 如何启动spark streaming和cassandra之间的流媒体?

Versions : 版本

Cassandra v2.1.12
Spark v1.4.1
Scala 2.10

To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. 要从Cassandra表创建DStream,可以使用ConstantInputDStream提供从Cassandra表创建的RDD作为输入。 This will result in the RDD being materialized on each DStream interval. 这将导致RDD在每个DStream间隔上实现。

Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job. 请注意,大小不断增大的大型表或表会对Streaming作业的性能产生负面影响。

See also: Reading from Cassandra using Spark Streaming for an example. 另请参阅: 使用Spark Streaming从Cassandra中读取示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM