Scala Spark RDD：rdd.take（）仅接受整数

Question

val fileContent=sc.textFile(path)
val x=fileContent.count()/2
fileContent.take(x) // gives error x: should be integer

x is long. x很长。 I can do `x.toInt , but what if x is too large to converted to Int? 我可以执行`x.toInt，但是如果x太大而不能转换为Int怎么办？

How to get second half of the rdd? 如何获得rdd的后半部分？

Answer 1

If you want all the elements you can use the collect method on the RDD. 如果需要所有元素，则可以在RDD上使用collect方法。

If you want specifically the first half of it which is already more than Integers max size, you could as others suggested filter out the half you dont need, so basically transforming it to another rdd with less items and collect on that. 如果您特别想要它的前半部分已经超过Integers最大大小，则可以像其他人建议的那样过滤掉不需要的一半，因此基本上将其转换为具有较少项目的另一个rdd并收集。 Like this: 像这样：

  val sizeOfRdd = fileContent.count()
  fileContent
    .zipWithIndex // assign an index to every element
    .filter(_._2 < sizeOfRdd/2) // filter out the first half
    .map(_._1) // drop the index
    .collect() // take every element

Note , both take and collect will move the elements in the dataset to the driver, where you might run into memory issues in case you have a lot of elements in the RDD (which I assume you have). 注意，获取和收集都会将数据集中的元素移动到驱动程序，如果您在RDD中有很多元素（我假设您已经拥有），则可能会遇到内存问题。

Answer 2

可以使用randomSplit（）将RDD分成较小的RDD数组，然后可以针对每个RDD在循环中执行take（）。

Scala Spark RDD：rdd.take（）仅接受整数

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-03-18 12:46:54

解决方案2
1 2016-03-18 10:35:03

Scala Spark RDD：rdd.take（）仅接受整数

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-03-18 12:46:54

解决方案2 1 2016-03-18 10:35:03

解决方案1
2 已采纳 2016-03-18 12:46:54

解决方案2
1 2016-03-18 10:35:03