简体   繁体   English

在火花流中聚合后努力处理重复数据删除

[英]struggling to handle deduplication after aggregation in spark streaming

1.streaming data is coming from kafka 2.consuming through spark streaming 3.firstname,lastname,userid and membername ( using member names i am getting the member count eg mark,tyson,2,chris,lisa,iwanka - so here member count is 3 1.streaming 数据来自 kafka 2.通过火花流消费 3.firstname,lastname,userid 和 membername(使用成员名称我得到成员数,例如 mark,tyson,2,chris,lisa,iwanka - 所以这里的成员数是 3

somehow i have to do the count its the requirmnt .不知何故,我必须计算它的要求。 but how can i remove deduplication after aggregation .但是如何在聚合后删除重复数据删除。 its my concern这是我关心的

  val df2=df.select(firstname,lastname,membercount,userid)
  df2.writestream.format("console").start().awaitTermination

  or     
 df3.select("*").where("membercount >= 3").dropDuplication("userid")

 // this one is not working , but i need to do the same after
   count only so that in batches same user id will not come again.
   only first time entry i want.

Batch-1 output批次 1 输出

  firstname         lastname          member-count            userid

  john              smith                   5                  1
  mark              boucher                 8                  2
  shawn              pollock                3                  3

batch-2 output批次 2 输出

 firstname         lastname           member-count        userid

 john               smith             7  (prev.count 5)         1
shawn               pollock           12  (prev.count 8)        3
chris               jordan            6                         4

// but here i want batch -2 ---------output // 但在这里我想要批处理 -2 ---------输出

1.The possibilty is the john smith ,shawn pollock count will increase again in next batches ,but i dont want to show or keep in output for next batches. 1.可能是约翰史密斯,肖恩波洛克计数将在下一批再次增加,但我不想显示或保留下一批的产量。

ie based on userid , i want entry for the one time only in batch output and neglect again the same user in batch output firstname lastname member-count userid chris jordan 6 4即基于 userid ,我只想在批处理输出中输入一次,并在批处理输出中再次忽略同一用户 firstname lastname member-count userid chris jordan 6 4

Your question is hard to read, but as I understand you want a while loop with a condition?您的问题很难阅读,但据我所知,您想要一个带条件的 while 循环?

var a = 10;
while(a < 20){
     println( "Value of a: " + a );
     a = a + 1;
  }

For example will print例如将打印

value of a: 10
value of a: 11
value of a: 12
value of a: 13
value of a: 14
value of a: 15
value of a: 16
value of a: 17
value of a: 18
value of a: 19

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM