简体   繁体   English

如何有效删除Spark RDD中的子集

[英]How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD. 在进行研究时,我发现删除Spark RDD中的所有子集有些困难。

The data structure is RDD[(key,set)] . 数据结构为RDD[(key,set)] For example, it could be: 例如,可能是:

RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]

Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)) , I want to delete "mike", which will end up with 由于迈克(Set(1,3))的集合是彼得(Set(1,2,3))的子集,因此我想删除“迈克”,其结果将是

RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]

It is easy to implement in python locally with two "for" loop operation. 使用两个“ for”循环操作很容易在python中本地实现。 But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution. 但是,当我想使用scala和spark扩展到云时,找到一个好的解决方案并不容易。

Thanks 谢谢

This can be achieved by using RDD.fold function. 这可以通过使用RDD.fold函数来实现。
In this case the output required is a "List" (ItemList) of superset items. 在这种情况下,所需的输出是超集项目的“列表”(ItemList)。 For this the input should also be converted to "List" (RDD of ItemList) 为此,输入也应转换为“列表”(ItemList的RDD)

import org.apache.spark.rdd.RDD

// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]

// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )


// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter 
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))

// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
    def helper(lst: ItemList, i:Item) : ItemList = {
        val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
        if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
    }
    second.foldLeft(first)(helper)
}


listOflst.fold(List())(combiner)

I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). 我怀疑我们是否可以将每个元素相互比较(相当于非分布式算法中的双循环)。 The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice" . 集之间的子集操作不是自反的,这意味着我们需要比较的is "alice" subsetof "bob"is "bob" subsetof "alice"

To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix: 要使用Spark API做到这一点,我们可以使用笛卡尔积将数据与其自身相乘,并验证结果矩阵的每个条目:

val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga,  anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if  (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)    

// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))

You can use filter after a map. 您可以在地图后使用过滤器。

You can build like a map that will return a value for what you want to delete. 您可以像地图一样构建地图,该地图将为您要删除的内容返回一个值。 First build a function: 首先建立一个函数:

def filter_mike(line):
    if line[1] != Set(1,3):
        return line
    else:
        return None

Then you can filter now like this: 然后,您现在可以像这样进行过滤:

your_rdd.map(filter_mike).filter(lambda x: x != None)

This will work 这会起作用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM