简体   繁体   English

如何在Spark中处理大数据文件的一小部分?

[英]How to work on small portion of big Data File in spark?

I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. 我已经在Spark中加载了大数据文件,但希望在其中进行一小部分工作以运行分析,有什么办法吗? I tried doing repartition but it brings a lot of reshuffling. 我尝试进行分区,但是带来了很多改组。 Is there any good of way of processing the only small chunk of a Big file loaded in Spark?. 有没有什么好的方法可以处理Spark中加载的Big文件的唯一小块?

In short 简而言之

You can use sample() or randomSplit() transformations on RDD 您可以在RDD上使用sample()randomSplit()转换

sample() 样品()

/**
  * Return a sampled subset of this RDD.
  *
  * @param withReplacement can elements be sampled multiple times
  * @param fraction expected size of the sample as a fraction of this RDD's size
  *  without replacement: probability that each element is chosen; fraction must be [0, 1]
  *  with replacement: expected number of times each element is chosen; fraction must be 
  *  greater than or equal to 0
  * @param seed seed for the random number generator
  *
  * @note This is NOT guaranteed to provide exactly the fraction of the count
  * of the given [[RDD]].
  */

  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T]

Example: 例:

val sampleWithoutReplacement = rdd.sample(false, 0.2, 2)

randomSplit() randomSplit()

/**
  * Randomly splits this RDD with the provided weights.
  *
  * @param weights weights for splits, will be normalized if they don't sum to 1
  * @param seed random seed
  *
  * @return split RDDs in an array
  */

def randomSplit(
   weights: Array[Double],
   seed: Long = Utils.random.nextLong): Array[RDD[T]]

Example: 例:

val rddParts = randomSplit(Array(0.8, 0.2)) //Which splits RDD into 80-20 ratio

You can use any of the following RDD API's : 您可以使用以下任何RDD API:

  1. yourRDD.filter(on some condition)
  2. yourRDD.sample(<with replacement>,<fraction of data>,<random seed>)

Ex: yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt) 例如: yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt)

If you want any random fraction of data I suggest you use second method. 如果您需要任何随机部分的数据,建议您使用第二种方法。 Or if you need part of the data satisfying some condition use the first one. 或者,如果您需要满足某些条件的部分数据,请使用第一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM