简体   繁体   English

如何使用spark python在cassandra表中保存数据?

[英]How to save data in cassandra table using spark python?

I'm trying to create consumer-producer application. 我正在尝试创建消费者 - 生产者应用程序。

Application's Producer will produce some data on particular topic. 应用程序的生产者将生成有关特定主题的一些数据。 Consumer will consume this data from the same topic and process it using spark api and store this data is cassandra table. 消费者将使用相同主题消费此数据并使用spark api处理它并将此数据存储为cassandra表。

Incoming data comming in string format like below - 传入的数据以字符串格式提交,如下所示 -

100=NO|101=III|102=0.0771387731911|103=-0.7076915761 100=NO|101=AAA|102=0.8961325446464|103=-0.5465463154 100 = NO | 101 = III | 102 = 0.0771387731911 | 103 = -0.7076915761 100 = NO | 101 = AAA | 102 = 0.8961325446464 | 103 = -0.5465463154

I created consumer in the bellow manner: 我以下面的方式创建了消费者:

from kafka import KafkaConsumer
from StringIO import StringIO
import pandas as pd
from cassandra.cluster import Cluster

from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

def main():

    sc = SparkContext(appName="StreamingContext")
    ssc = StreamingContext(sc, 3)

    kafka_stream = KafkaUtils.createStream(ssc, "localhost:2181", "sample-kafka-app", {"NO-topic": 1})
    raw = kafka_stream.flatMap(lambda kafkaS: [kafkaS])
    clean = raw.map(lambda xs: xs[1].split("|"))
    my_row = clean.map(lambda x: {
       "pk": "uuid()",
       "a": x[0],
       "b": x[1],
       "c": x[2],
       "d": x[3],
    })

    my_row.saveToCassandra("users", "data")
    stream.start()
    stream.awaitTermination()

if __name__ == "__main__":
    main()

Cassandra table structure - Cassandra表结构 -

cqlsh:users> select * from data;

 pk | a | b | c | d
----+---+---+---+---
CREATE TABLE users.data (
    pk uuid PRIMARY KEY,
    a text,
    b text,
    c text,
    d text
)

I'm facing below error - 我面临以下错误 -

Traceback (most recent call last):


File "consumer_no.py", line 84, in <module>
    main()
  File "consumer_no.py", line 53, in main
    my_row.saveToCassandra("users", "data")
AttributeError: 'TransformedDStream' object has no attribute 'saveToCassandra'
17/04/04 14:29:22 INFO SparkContext: Invoking stop() from shutdown hook

Is I'm going on a correct way to achieve which I explain above? 我是否正在以正确的方式实现上述解释? If not then give me suggestions to achieve this and if yes then what's wrong/missing in above code? 如果没有,那么给我建议实现这一点,如果是,那么上面的代码有什么错误/缺失?

Rather than directly trying to save your TransformedDStream to Cassandra, you should save each RDD from that DStream to cassandra. 您应该将每个RDD从该DStream保存到cassandra,而不是直接尝试将TransformedDStream保存到Cassandra。

Your code should work if you do something like this: 如果您执行以下操作,您的代码应该可以运行:

my_row.foreachRDD(lambda x: x.saveToCassandra("users", "data"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM