[英]How to work on small portion of big Data File in spark?
I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. 我已经在Spark中加载了大数据文件,但希望在其中进行一小部分工作以运行分析,有什么办法吗? I tried doing repartition but it brings a lot of reshuffling.
我尝试进行分区,但是带来了很多改组。 Is there any good of way of processing the only small chunk of a Big file loaded in Spark?.
有没有什么好的方法可以处理Spark中加载的Big文件的唯一小块?
In short
简而言之
You can use
sample()
orrandomSplit()
transformations on RDD您可以在RDD上使用
sample()
或randomSplit()
转换
/**
* Return a sampled subset of this RDD.
*
* @param withReplacement can elements be sampled multiple times
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be
* greater than or equal to 0
* @param seed seed for the random number generator
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T]
Example: 例:
val sampleWithoutReplacement = rdd.sample(false, 0.2, 2)
/**
* Randomly splits this RDD with the provided weights.
*
* @param weights weights for splits, will be normalized if they don't sum to 1
* @param seed random seed
*
* @return split RDDs in an array
*/
def randomSplit(
weights: Array[Double],
seed: Long = Utils.random.nextLong): Array[RDD[T]]
Example: 例:
val rddParts = randomSplit(Array(0.8, 0.2)) //Which splits RDD into 80-20 ratio
You can use any of the following RDD
API's : 您可以使用以下任何
RDD
API:
yourRDD.filter(on some condition)
yourRDD.sample(<with replacement>,<fraction of data>,<random seed>)
Ex: yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt)
例如:
yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt)
If you want any random fraction of data I suggest you use second method. 如果您需要任何随机部分的数据,建议您使用第二种方法。 Or if you need part of the data satisfying some condition use the first one.
或者,如果您需要满足某些条件的部分数据,请使用第一个。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.