[英]Two big files join as one to many relationship in Java Spark
I have two big files我有两个大文件
For simplicity say为简单起见说
email file is having:
eId emailcontent
e1 xxxxxxxx
e2 yyyyyyyy
e3 zzzzzzzz
attachment file is having:
aid attachmentcontent eid
a1 att1 e1
a2 att2 e1
a3 att3 e2
a4 att4 e3
a5 att5 e3
a6 att6 e3
NOTE: Broadcast variable join has already performed with email file with some other small file.注意:广播变量连接已经使用 email 文件和其他一些小文件执行。 Both files are big enough that broadcast variable can't be used again.这两个文件都足够大,广播变量不能再次使用。
I want to join these two files using JavaPairRDD
with eid
as join column but can't make pairRDD with eid
because with same eid
key multiple attachments are linked.我想使用带有eid
作为连接列的JavaPairRDD
连接这两个文件,但不能使用eid
制作 pairRDD,因为使用相同的eid
密钥链接了多个附件。
Tried to convert the JavaRDD<Email>
and JavaRDD<Attachment>
to Dataset and perform the join operation, but Email class is complex class(it contains multiple classes as list of variables) hence converting to Dataset does not return any records in it.试图将JavaRDD<Email>
和JavaRDD<Attachment>
转换为 Dataset 并执行连接操作,但 Email class 是复杂类(它包含多个类作为变量列表),因此转换为 Dataset 不会返回其中的任何记录。
Above two approaches are not solving my problem.以上两种方法都没有解决我的问题。 Hence looking for any solution which is not considered here or in above considered scenarios if I am missing something.因此,如果我遗漏了什么,请寻找此处或上述场景中未考虑的任何解决方案。
Above problem is solved using JavaPairRDD
.使用JavaPairRDD
解决了上述问题。
For email file created JavaPairRDD<eId, Email>
as eId
is unique for each email and for attachment file created JavaPairRDD<eId, Iterator<Attachment>>
as eId
is having multiple attachments.对于 email 文件创建JavaPairRDD<eId, Email>
因为eId
对于每个 email 和附件文件创建JavaPairRDD<eId, Iterator<Attachment>>
因为eId
具有多个附件。
Then created JavaPairRDD for email: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email));
然后为 email 创建 JavaPairRDD: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email));
and JavaPairRDD for attachment: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();
和用于附件的 JavaPairRDD: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();
Finally performed the rddEmail.join(rddAttachment)
and other logics as per requirement.最后按照要求执行了rddEmail.join(rddAttachment)
等逻辑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.