简体   繁体   English

带火花流的动态过滤器

[英]Dynamic filters with spark-streaming

I'm using spark-streaming for below use-case : 我在以下用例中使用spark-streaming:

  1. I've a kafka topic - data. 我有一个kafka主题-数据。 From this topic, I'm streaming in real-time data using structured spark streaming and apply some filters on it. 从本主题开始,我正在使用结构化火花流传输实时数据并在其上应用一些过滤器。 If the number of rows returned after applying the filters is greater than 1 then the output is 1 else the output is 0 along with some other data from the query. 如果应用过滤器后返回的行数大于1,则输出为1,否则输出为0以及查询中的其他数据。

    In simple words, suppose I'm filtering the stream using - 简单来说,假设我正在使用-过滤流

     df.filter($A < 10) 

    where "A", "<" and "10" are dynamic and comes from some database. 其中“ A”,“ <”和“ 10”是动态的,并且来自某个数据库。 In actual, these values comes from kafka topic which I'm consuming and updating those values in db. 实际上,这些值来自kafka主题,我正在使用和更新db中的这些值。 So the query is not static and will be updated after sometime. 因此查询不是静态的,将在一段时间后更新。

  2. Further, I'll have to apply some boolean algeric operators on the results of streams. 此外,我将不得不对流的结果应用一些布尔运算符。 For eg - 例如-

     df.filter($A < 10) AND df.filter($B = 1) OR df.filter($C > 1)... and so on 

    Here, each of the atomic operation (like df.filter($A < 10)) returns either 0 or 1 as described above. 在这里,每个原子操作(如df.filter($ A <10))如上所述返回0或1。 Final result is saved to mongo. 最终结果保存到mongo。

I want to know if both problems can be used using structured spark streaming. 我想知道是否可以通过结构化火花流使用这两个问题。 If not, then using RDD ? 如果没有,那么使用RDD吗?

Otherwise, can someone suggest any way to do this ? 否则,有人可以建议任何方法吗?

For the first case you can use a broadcast variable based approach as described in this answer . 对于第一种情况,您可以使用此答案中所述的基于广播变量的方法。 I've also had good luck using a per-executor transient value that was periodically refetched in each executor as described in the second part of this answer . 我还很幸运地使用了每个执行者的瞬态值,该值在此答案的第二部分中进行了描述,并在每个执行者中定期进行重新提取。

For the second case you would use a single filter() call that implements the complete set of conditions that causes a message to be included in the output stream. 对于第二种情况,您将使用单个filter()调用来实现导致将消息包含在输出流中的完整条件集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM