简体   繁体   English

如何通过.map在另一个RDD中传递一个RDD

[英]How to pass one RDD in another RDD through .map

I have two rdd's, and I want to do some computation on RDD2 items for each item of rdd1. 我有两个rdd,我想对rdd1的每个项目的RDD2项目进行一些计算。 So, I am passing RDD2 in a user defined function as below but I am getting error like rdd1 cannot be passed in another rdd . 所以,我在用户定义的函数中传递RDD2,如下所示,但我得到的错误就像rdd1 cannot be passed in another rdd Can I know how to achieve this if I want to perform operations on two rdd's? 如果我想在两个rdd上执行操作,我可以知道如何实现这个目的吗?

For example: 例如:

RDD1.map(line =>function(line,RDD2))

Nesting RDDs is not supported by Spark, as the error says. 错误说明,Spark不支持嵌套RDD。 Usually you have to go around it by redesigning your algorithm. 通常你必须通过重新设计算法来绕过它。

How to do it depends on an actual use case, what does exactly happen in function and what is it's output. 如何做到这取决于实际的用例, function中究竟发生了什么以及它的输出是什么。

Sometimes a RDD1.cartesian(RDD2) , doing operations per tuple and then reducing by key will work. 有时RDD1.cartesian(RDD2) ,每个元组执行操作然后按键减少将起作用。 Sometimes, if you have (K,V) type a join between both RDD will work. 有时,如果你有(K,V)类型,两个RDD之间的连接将起作用。

If RDD2 is small you can always collect it in the driver, make it a broadcast variable and use that variable in function instead of RDD2 . 如果RDD2很小,你总是可以在驱动程序中收集它,使它成为一个广播变量,并在function使用该变量而不是RDD2

@Edit: @编辑:

For example's sake let's assume your RDDs hold strings and function would count how many times a given record from RDD occurs in RDD2 : 例如,假设您的RDD持有字符串, function将计算RDD中给定RDD记录的RDD2

def function(line: String, rdd: RDD[String]): (String, Int) = {
   (line, rdd.filter(_ == line).count)
} 

This would return an RDD[(String, Int)] . 这将返回RDD[(String, Int)]

Idea1 Idea1

You can try using a cartesian product using RDD's cartesian method. 您可以尝试使用RDD的cartesian方法使用笛卡尔积

val cartesianProduct = RDD1.cartesian(RDD2) // creates RDD[(String, String)]
                           .map( (r1,r2) => (r1, function2) ) // creates RDD[(String, Int)]
                           .reduceByKey( (c1,c2) => c1 + c2 ) // final RDD[(String, Int)]

Here function2 takes r1 and r2 (which are strings) and returns 1 if they are equal and 0 if not. 这里function2r1r2 (它们是字符串),如果相等则返回1否则返回0 The final map would result in an RDD which would have tuples where the key would be a record from r1 and value would be the total count. 最终的映射将产生一个RDD ,它将具有元组,其中键将是来自r1的记录,值将是总计数。

Problem1: This would NOT work if you have duplicate strings in RDD1 , though. 问题1:如果你在RDD1有重复的字符串,这将不起作用。 You'd have to think about it. 你必须考虑一下。 If RDD1 records have some unique ids that would be perfect. 如果RDD1记录有一些完美的唯一ID。

Problem2: this does create A LOT of pairs (for 1mln records in both RDD it would create around 500bln pairs), would be slow and most probably result in a lot of shuffling . 问题2:这确实创造了很多对(对于两个RDD中的1mln记录,它将创建大约500bln对),会很慢并且很可能导致大量的混乱

Idea2 Idea2

I didn't understand your comment regarding RDD2's size lacs so this might or might not work: 我不明白关于RDD2的大小您的评论lacs所以这可能或可能无法正常工作:

val rdd2array = sc.broadcast(RDD2.collect())
val result = RDD1.map(line => function(line, rdd2array))

Problem: this might blow up your memory. 问题:这可能会炸毁你的记忆。 collect() is called on the driver and all records from rdd2 will be loaded into memory on the driver node. driver上调用collect()rdd2 all记录加载到驱动程序节点上的内存中。

Idea3 Idea3

Depending on the use case there are other ways to overcome this, for instance brute force algorithm for Similarity Search is similar (pun not intended) to your use case. 根据用例,还有其他方法可以解决这个问题,例如,相似性搜索的强力算法与您的用例类似(不是意图)。 One of alternative solutions for this is Locality Sensitive Hashing . 对此的替代解决方案之一是Locality Sensitive Hashing

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM