简体   繁体   English

Spark RDD接受和删除行

[英]Spark RDD take and delete rows

I have an RDD of 1000 elements. 我有1000个元素的RDD。 I want to take 100 elements from it and then remove those 100 from the initial RDD. 我想从中取出100个元素,然后从初始RDD中删除这100个元素。 But I'm not able to find a way for after trying multiple ways. 但是尝试多种方法后,我找不到方法。

var part = dataRDD.take(100)

part is an Array[String] 部分是一个数组[String]

I want to delete the 100 elements from the 100 of dataRDD. 我想从100个dataRDD中删除100个元素。

var dataRDD = dataRDD.filter(row => row != part)

The above doesn't show any error but doesn't remove any rows. 上面没有显示任何错误,但没有删除任何行。 dataRDD still has the same 1000 rows. dataRDD仍然具有相同的1000行。

Can you please guide on how to get this work. 能否请您指导如何进行这项工作。

Method "zipWithIndex" can be used for split rdd: 方法“ zipWithIndex”可用于拆分rdd:

val zipped = rdd.zipWithIndex()
val first100 = zipped.filter(_._2 < 100).keys
val remaining = zipped.filter(_._2 >= 100).keys

You can write like: 您可以这样写:

var part = sc.parallelize(dataRDD.take(100))
val result = dataRDD.subtract(part)
result.foreach(x => foreach(x))

printing RDD for testing purpose only. 打印RDD仅用于测试目的。

Most probably your code doesn't work as you expect because row != part is always true . 很可能您的代码无法按预期工作,因为row != part始终为true != in this context is a references comparison between arrays from java. 在这种情况下, !=是来自Java的数组之间的引用比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM