简体   繁体   English

如何使用火花数据框评估火花Dstream对象

[英]How to evaluate spark Dstream objects with an spark data frame

I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database 我正在编写一个spark应用程序,我需要根据历史数据来评估流数据,这些数据位于sql server数据库中

Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it . 现在的想法是,spark将从数据库中获取历史数据并将其保存在内存中,并将根据它评估流数据。

Now I am getting the streaming data as 现在我正在获取流数据

import re
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext,functions as func,Row


sc = SparkContext("local[2]", "realtimeApp")
ssc = StreamingContext(sc,10)
files = ssc.textFileStream("hdfs://RealTimeInputFolder/")

########Lets get the data from the db which is relavant for streaming ###

driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
dataurl = "jdbc:sqlserver://myserver:1433"
db = "mydb"
table = "stream_helper"
credential = "my_credentials"

########basic data for evaluation purpose ########



files_count = files.flatMap(lambda file: file.split( ))

pattern =  '(TranAmount=Decimal.{2})(.[0-9]*.[0-9]*)(\\S+ )(TranDescription=u.)([a-zA-z\\s]+)([\\S\\s]+ )(dSc=u.)([A-Z]{2}.[0-9]+)'


tranfiles = "wasb://myserver.blob.core.windows.net/RealTimeInputFolder01/"

def getSqlContextInstance(sparkContext):
    if ('sqlContextSingletonInstance' not in globals()):
        globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
    return globals()['sqlContextSingletonInstance']


def pre_parse(logline):
    """
    to read files as rows of sql in pyspark streaming using the pattern . for use of logging 
    added 0,1 in case there is any failure in processing by this pattern

    """
    match = re.search(pattern,logline)
    if match is None:
        return(line,0)
    else:
        return(
        Row(
        customer_id = match.group(8)
        trantype = match.group(5)
        amount = float(match.group(2))
        ),1)


def parse():
    """
    actual processing is happening  here 
    """
    parsed_tran = ssc.textFileStream(tranfiles).map(preparse)
    success = parsed_tran.filter(lambda s: s[1] == 1).map(lambda x:x[0])
    fail = parsed_tran.filter(lambda s:s[1] == 0).map(lambda x:x[0])
    if fail.count() > 0:
        print "no of non parsed file : %d", % fail.count()

    return success,fail

success ,fail = parse()

Now I want to evaluate it by the data frame that I get from the historical data 现在我想通过我从历史数据中获得的数据框来评估它

base_data = sqlContext.read.format("jdbc").options(driver=driver,url=dataurl,database=db,user=credential,password=credential,dbtable=table).load()

Now since this being returned as a data frame how do I use this for my purpose . 现在,因为这是作为数据框返回的,我如何将其用于我的目的。 The streaming programming guide here says 流媒体节目指南这里
"You have to create a SQLContext using the SparkContext that the StreamingContext is using." “你必须使用StreamingContext正在使用的SparkContext创建一个SQLContext。”

Now this makes me even more confused on how to use the existing dataframe with the streaming object . 现在,这使我对如何将现有数据帧与流对象一起使用更加困惑。 Any help is highly appreciated . 任何帮助都非常感谢。

To manipulate DataFrames, you always need a SQLContext so you can instanciate it like : 要操作DataFrame,您始终需要一个SQLContext,以便您可以将其实例化为

sc = SparkContext("local[2]", "realtimeApp")
sqlc = SQLContext(sc)
ssc = StreamingContext(sc, 10)

These 2 contexts ( SQLContext and StreamingContext ) will coexist in the same job because they are associated with the same SparkContext . 这两个上下文( SQLContextStreamingContext )将在同一个作业中共存,因为它们与同一个SparkContext相关联。 But, keep in mind, you can't instanciate two different SparkContext in the same job. 但是,请记住,您无法在同一作业中实现两个不同的SparkContext。

Once you have created your DataFrame from your DStreams, you can join your historical DataFrame with the DataFrame created from your stream. 从DStreams创建DataFrame后,您可以将历史DataFrame与从流中创建的DataFrame一起加入。 To do that, I would do something like : 要做到这一点,我会做类似的事情:

yourDStream.foreachRDD(lambda rdd: sqlContext
    .createDataFrame(rdd)
    .join(historicalDF, ...)
    ...
)

Think about the amount of streamed data you need to use for your join when you manipulate streams, you may be interested by the windowed functions 考虑一下操作流时需要用于连接的流数据量,您可能会对窗口函数感兴趣

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM