简体   繁体   English

Spark Streaming历史状态

[英]Spark Streaming historical state

I am building real time processing for detecting fraud ATM card transaction. 我正在构建用于检测欺诈ATM卡交易的实时处理。 in order to efficiently detect fraud, logic requires to have last transaction date by card, sum of transaction amount by day (or last 24 Hrs.) 为了有效地检测欺诈,逻辑要求按卡分配最后交易日期,交易金额按天(或最后24小时)计算。

One of usecase is if card transaction outside native country for more than a 30 days of last transaction in that country then send alert as possible fraud 其中一个用例是,如果在该国家/地区之外的卡片交易超过该国家最后一次交易的30天,则会发送警报作为可能的欺诈行为

So tried to look at Spark streaming as a solution. 所以试着将Spark流视为一种解决方案。 In order to achieve this (probably I am missing idea about functional programming) below is my psudo code 为了实现这一目标(可能我对功能编程缺乏了解),下面是我的psudo代码

stream=ssc.receiverStream() //input receiver 
s1=stream.mapToPair() // creates key with card and transaction date as value
s2=stream.reduceByKey() // applies reduce operation for last transaction date 
s2.checkpoint(new Duration(1000));
s2.persist();

I am facing two problem here 我这里面临两个问题

1) how to use this last transaction date further for future comparison from same card 1)如何进一步使用此最后交易日期以便将来从同一张卡进行比较
2) how to persist data so even if restart drive program then old values of s2 restores back 3) updateStateByKey can used to maintain historical state? 2)如何保持数据所以即使重启驱动程序然后s2的旧值恢复回来3) updateStateByKey可以用来维护历史状态?

I think I am missing key point of spark streaming/functional programming that how to implement this kind of logic. 我想我错过了如何实现这种逻辑的火花流/函数编程的关键点。

If you are using Spark Streaming you shouldn't really save your state on a file, especially if you are planning to run your application 24/7. 如果您使用Spark Streaming,则不应该将您的状态保存在文件中,特别是如果您计划全天候运行应用程序。 If that is not your intention, you will be probably be fine with just a Spark application since you are facing only big data computation and not computation over batches coming real time. 如果这不是您的意图,那么您可能只使用Spark应用程序就可以了,因为您只面临大数据计算而不是实时批量计算。

Yes, updateStateByKey can be used to maintain state through the various batches but it has a particular signature that you can see in the docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions 是的,updateStateByKey可用于通过各种批次维护状态,但它有一个特定的签名,您可以在文档中看到: http//spark.apache.org/docs/latest/api/scala/index.html#org .apache.spark.streaming.dstream.PairDStreamFunctions

Also persist() it's just a form of caching, it doesn't actually persist your data on disk (like on a file). 另外persist()它只是一种缓存形式,它实际上并不会将数据保存在磁盘上(比如在文件中)。

Hope to have clarified some of your doubts. 希望澄清你的一些疑虑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM