简体   繁体   English

RDD通过键删除元素

[英]RDD Remove elements by key

I have 2 RDD's that are pulled in with the following code: 我有2个RDD,其中包含以下代码:

val fileA = sc.textFile("fileA.txt")
val fileB = sc.textFile("fileB.txt")

I then Map and Reduce it by key: 然后,我通过按键映射并缩小它:

val countsB = fileB.flatMap(line => line.split("\n"))
  .map(word => (word, 1))
  .reduceByKey(_+_)

val countsA = fileA.flatMap(line => line.split("\n"))
  .map(word => (word, 1))
  .reduceByKey(_+_)

I now wan't to find and remove all keys in countB if the key exist in countA 现在,如果密钥存在于countA中,我将无法找到并删除co​​untB中的所有密钥

I have tried something like: 我已经尝试过类似的东西:

countsB.keys.foreach(b => {
  if(countsB.collect().exists(_ == b)){
    countsB.collect().drop(countsB.collect().indexOf(b))
  }
})

but it doesn't seem like it removes them by the key. 但似乎并不能通过键将其删除。

There are 3 issues with your suggested code: 建议的代码存在3个问题:

  1. You are collect ing the RDDs, which means they are not RDDs anymore, they are copied into the driver application's memory as plain Scala collections, so you lose Spark's parallelism and risk OutOfMemory errors in case your dataset is large 您正在collect RDD,这意味着它们不再是RDD,它们将以纯Scala集合的形式复制到驱动程序应用程序的内存中,因此,如果数据集很大,则会丢失Spark的并行性并冒出内存不足错误的风险

  2. When calling drop on an immutable Scala collection (or an RDD ), you don't change the original collection, you get a new collection with those records dropped, so you can't expect original collection to change 当在不可变的Scala集合(或RDD )上调用drop时,您不会更改原始集合,而是会删除掉这些记录,从而得到一个集合,因此您不能指望原始集合会发生变化

  3. You cannot access an RDD within a function passed to any of the RDDs higher-order methods (eg foreach in this case) - any function passed to these method is serialized and sent to workers, and RDD s are (intentionally) not serializable - it makes no sense to fetch them into driver memory, serialize them, and send back to workers - the data is already distributed on the workers! 您不能在传递给任何RDD高阶方法的函数中访问RDD (例如,在这种情况下为foreach )-传递给这些方法的任何函数都被序列化并发送给worker,并且RDD (有意地)不可序列化-它将它们提取到驱动程序内存中,对其进行序列化并发送回工作程序没有任何意义-数据已经在工作程序中分发了!

To solve all these - when you want to use one RDD's data to transform/filter another one, you usually want to use some type of join . 为了解决所有这些问题-当您要使用一个RDD的数据来转换/过滤另一个数据时,通常需要使用某种类型的join In this case you can do: 在这种情况下,您可以执行以下操作:

// left join, and keep only records for which there was NO match in countsA:
countsB.leftOuterJoin(countsA).collect { case (key, (valueB, None)) => (key, valueB) }

NOTE that this collect that I'm using here isn't the collect you used - this one takes a PartialFunction as an argument, and behaves like a combination of map and filter , and most importantly: it doesn't copy all data into driver memory. 请注意,这个collect ,我使用这里是不collect你用-这其中需要PartialFunction作为参数,并表现等的组合mapfilter ,以及最重要的是:它不会将所有的数据复制到驱动器记忆。

EDIT : as The Archetypal Paul commented - you have a much shorter and nicer option - subtractByKey : 编辑 :正如原型保罗所评论的那样-您有一个更短,更好的选项- subtractByKey

countsB.subtractByKey(countsA)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM