简体   繁体   English

如何将Spark Streaming与Tensorflow集成?

[英]How to integrate Spark Streaming with Tensorflow?

Objective: Continuously feeding sniffed network packages into a Kafka Producer, connecting this to Spark Streaming to be able to process package data, After that, using the preprocessed data in Tensorflow or Keras. 目标:持续将嗅探到的网络程序包馈入Kafka Producer,并将其连接到Spark Streaming以能够处理程序包数据,然后,使用Tensorflow或Keras中的预处理数据。

I'm processing continuous data in Spark Streaming (PySpark) which comes from Kafka and now I want to send processed data to Tensorflow. 我正在使用来自Kafka的Spark Streaming(PySpark)处理连续数据,现在我想将处理后的数据发送到Tensorflow。 How can I use these Transformed DStreams in Tensorflow with Python? 如何使用Python在Tensorflow中使用这些转换的DStream? Thanks. 谢谢。

Currently no processing applied in Spark Streaming but will be added later. 目前在Spark Streaming中未应用任何处理,但稍后会添加。 Here's the py code: 这是py代码:

import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.conf import SparkConf
from datetime import datetime

if __name__ == '__main__':
    sc = SparkContext(appName='Kafkas')
    ssc = StreamingContext(sc, 2)
    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], 
                                       {'metadata.broker.list': brokers})
    lines = kvs.map(lambda x: x[1])
    lines.pprint()
    ssc.start()
    ssc.awaitTermination()

Also I use this to start spark streaming: 我也用它来启动火花流:

spark-submit --packages org.apache.spark:spark-streaming-kafka-0–8_2.11:2.0.0 
spark-kafka.py localhost:9092 topic

You have two ways to solve your problem : 您有两种方法可以解决问题:

  1. Once your processed your data, you can save them, then independently run your model (in Keras ?). 处理完数据后,您可以保存它们,然后独立运行模型(在Keras中)。 Just create a parquet file / append to it if it already exists : 只需创建一个实木复合地板文件/如果已经存在,则追加到该文件:

     if os.path.isdir(DATA_TREATED_PATH): data.write.mode('append').parquet(DATA_TREATED) else: data.write.parquet(DATA_TREATED_PATH) 

And then you just create your model with keras / tensorflow and you run it like every hour maybe ? 然后,您仅使用keras / tensorflow创建模型,并可能像每小时一样运行它? Or as many time as you want it to be updated. 或您想要更新的时间。 So this is run from scratch everytime. 因此,这是从头开始的。

  1. You process your data, save them as before but after that, you load you model, train your new data / new batch and then save your model. 您可以处理数据,像以前一样保存它们,但是此后,您可以加载模型,训练新数据/新批次,然后保存模型。 This is called Online Learning because you don't run your model from scratch. 之所以称为在线学习,是因为您不必从头开始运行模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM