简体   繁体   English

根据另一个RDD的值更新RDD

[英]Updating an RDD based on value of the other RDD

I want to update an rdd based on the values of another rdd. 我想根据另一个rdd的值更新一个rdd。 I've tried these three approaches: 1. use left join 2. use subtract by Key and then union 3. use a map and if conditions inside it 我已经尝试了以下三种方法:1.使用左连接2.使用Key减去,然后合并3.使用映射以及其中的条件

However the three approaches mentioned are so slow. 但是,提到的三种方法太慢了。

Here's an example: rdd1 contains an rdd based on distinct userID and productID I have. 这是一个示例:rdd1包含一个基于我拥有的不同userID和productID的rdd。 Example, if I have user Id from 0 to 100, I have product Ids from 0 to 100. I have to initially have a rating of 0s for all of them. 例如,如果我的用户ID从0到100,则我的产品ID从0到100。我最初必须对所有这些用户的等级都设置为0。 rdd1 = [(1,1,0.0),(1,2,0.0),(1,3,0.0),...,(100,100,0.0)] rdd1 = [(1,1,0.0),(1,2,0.0),(1,3,0.0),...,(100,100,0.0)]

Then rdd2 contains the ratings of specific userIds and productIds. 然后rdd2包含特定userId和productId的等级。 rdd2 = [(1,1,3.0),(100,100,4.0)] rdd2 = [(1,1,3.0),(100,100,4.0)]

What I want is to include all userIds and productIds in the matrix for collaborative filtering even though there is no rating corresponding to it. 我想要的是在矩阵中包括所有userIds和productIds以进行协作过滤,即使没有与之相对应的评分也是如此。 I need to do this in order to use explicit ALS in Spark MLLib. 为了在Spark MLLib中使用显式ALS,我需要执行此操作。 If I'm not going to augment 0, I will be getting nonsensical results as the code for explicit does not include scenario wherein there are unobserved values. 如果我不打算增加0,我将得到毫无意义的结果,因为用于显式的代码不包括存在无法观测的值的情况。 Thus, they are considered missing instead of zero. 因此,它们被视为丢失而不是零。

In short, I want to produce this rdd: rdd = [(1,1,3.0),(1,1,0.0),(1,2,0.0),...,(100,100,4.0)] 简而言之,我想生成此rdd:rdd = [(1,1,3.0),(1,1,0.0),(1,2,0.0),...,(100,100,4.0)]

Do you have some ideas on the fastest way to do this in terms of running time? 在运行时间方面,您是否对最快的方法有一些想法? I have two rdd with millions of entries to be used in updating. 我有两个rdd,其中有数百万个条目要用于更新。

You can simply do: 您可以简单地执行以下操作:

val res: RDD[(Integer, Integer)] = 
  rdd1.leftOuterJoin(rdd2)
      .mapValues { case (v, wOpt) => wOpt.getOrElse(v) }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM