简体   繁体   English

Java Spark 中的两个大文件作为一对多关系加入

[英]Two big files join as one to many relationship in Java Spark

I have two big files我有两个大文件

  1. email file email 文件
  2. attachment file附件文件

For simplicity say为简单起见说

email file is having:
eId  emailcontent 
e1     xxxxxxxx
e2     yyyyyyyy
e3     zzzzzzzz

attachment file is having:
aid   attachmentcontent   eid
a1       att1             e1  
a2       att2             e1  
a3       att3             e2
a4       att4             e3
a5       att5             e3
a6       att6             e3

NOTE: Broadcast variable join has already performed with email file with some other small file.注意:广播变量连接已经使用 email 文件和其他一些小文件执行。 Both files are big enough that broadcast variable can't be used again.这两个文件都足够大,广播变量不能再次使用。

I want to join these two files using JavaPairRDD with eid as join column but can't make pairRDD with eid because with same eid key multiple attachments are linked.我想使用带有eid作为连接列的JavaPairRDD连接这两个文件,但不能使用eid制作 pairRDD,因为使用相同的eid密钥链接了多个附件。

Tried to convert the JavaRDD<Email> and JavaRDD<Attachment> to Dataset and perform the join operation, but Email class is complex class(it contains multiple classes as list of variables) hence converting to Dataset does not return any records in it.试图将JavaRDD<Email>JavaRDD<Attachment>转换为 Dataset 并执行连接操作,但 Email class 是复杂类(它包含多个类作为变量列表),因此转换为 Dataset 不会返回其中的任何记录。

Above two approaches are not solving my problem.以上两种方法都没有解决我的问题。 Hence looking for any solution which is not considered here or in above considered scenarios if I am missing something.因此,如果我遗漏了什么,请寻找此处或上述场景中未考虑的任何解决方案。

Above problem is solved using JavaPairRDD .使用JavaPairRDD解决了上述问题。

For email file created JavaPairRDD<eId, Email> as eId is unique for each email and for attachment file created JavaPairRDD<eId, Iterator<Attachment>> as eId is having multiple attachments.对于 email 文件创建JavaPairRDD<eId, Email>因为eId对于每个 email 和附件文件创建JavaPairRDD<eId, Iterator<Attachment>>因为eId具有多个附件。

Then created JavaPairRDD for email: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email));然后为 email 创建 JavaPairRDD: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email)); and JavaPairRDD for attachment: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();和用于附件的 JavaPairRDD: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();

Finally performed the rddEmail.join(rddAttachment) and other logics as per requirement.最后按照要求执行了rddEmail.join(rddAttachment)等逻辑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM