简体   繁体   English

火花流记录比较

[英]Spark-Streaming Comparison of records

How do i compare the received record with previous record of same key in spark structured streaming. 如何在Spark结构化流中将接收到的记录与以前相同密钥的记录进行比较。 Can this be done using groupByKey and mapGroupWithState? 可以使用groupByKey和mapGroupWithState完成吗?

groupByKey(user)
mapGroupsWithState(GroupStateTimeout.NoTimeout)(updateAcrossEvents)

//Sample code from Spark Definitive Guide //来自Spark权威指南的示例代码

There is one more question arising when we perform the above operations I don't think so sequence of record will be maintained as the record is received it will partitioned and stored across worker nodes and when we apply groupByKey shuffle happens and all records with same key will be in the same worker node, but doesn't maintain the sequence. 当我们执行上述操作时,我不认为还会有一个问题,因此记录的顺序会随着接收到的记录而得以维护,并将在工作节点之间进行分区和存储,并且在应用groupByKey时会发生洗牌并且所有具有相同键的记录将在相同的工作程序节点中,但不保留顺序。

You can use mapGroupsWithState for this. 您可以为此使用mapGroupsWithState。 You will have to save the previous record in the group state and compare it with the incoming record. 您将必须将先前的记录保存为组状态,并将其与传入的记录进行比较。

What do you use as your source? 您使用什么作为来源? If the source is Kafka you will have to partition the Kafka topic by the key that you are using. 如果源是Kafka,则必须按所使用的密钥对Kafka主题进行分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM