[英]Stratified sampling with Spark and Java
I'd like to make sure I'm training on a stratified sample of my data. 我想确保我正在对我的数据进行分层抽样培训。
It seems this is supported by Spark 2.1 and earlier versions via JavaPairRDD.sampleByKey(...)
and JavaPairRDD.sampleByKeyExact(...)
as explained here . 看来,这是由星火2.1和更早版本支持通过
JavaPairRDD.sampleByKey(...)
和JavaPairRDD.sampleByKeyExact(...)
作为解释在这里 。
But: My data is stored in a Dataset<Row>
, not a JavaPairRDD
. 但是:我的数据存储在
Dataset<Row>
,而不是JavaPairRDD
。 The first column is the label, all others are features (imported from a libsvm-formatted file). 第一列是标签,所有其他都是功能(从libsvm格式的文件导入)。
What's the easiest way to get a stratified sample of my dataset instance and at the end have a Dataset<Row>
again? 获取数据集实例的分层样本的最简单方法是什么?最后再次有
Dataset<Row>
?
In a way this question is related to Dealing with unbalanced datasets in Spark MLlib . 在某种程度上,这个问题与在Spark MLlib中处理不平衡数据集有关 。
This possible duplicate does not mention Dataset<Row>
at all, neither is it in Java. 这个可能的副本根本没有提到
Dataset<Row>
,也没有提到Java。 It does not answer my question. 它没有回答我的问题。
Ok, since the answer of the question here was actually not intended for Java , I have rewritten it in Java . 好的,既然这里的问题的答案实际上不是针对Java的 ,那么我已经用Java重写了它。
The reasoning is still the same thought. 推理仍然是同样的想法。 We are still using
sampleByKeyExact
. 我们仍在使用
sampleByKeyExact
。 There is no out of the box miracle features for now ( spark 2.1.0 ) 现在没有开箱即用的奇迹功能( 火花2.1.0 )
So here you go : 所以你走了:
package org.awesomespark.examples;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import scala.Tuple2;
import java.util.Map;
public class StratifiedDatasets {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("Stratified Datasets")
.getOrCreate();
Dataset<Row> data = spark.read().format("libsvm").load("sample_libsvm_data.txt");
JavaPairRDD<Double, Row> rdd = data.toJavaRDD().keyBy(x -> x.getDouble(0));
Map<Double, Double> fractions = rdd.map(Tuple2::_1)
.distinct()
.mapToPair((PairFunction<Double, Double, Double>) (Double x) -> new Tuple2(x, 0.8))
.collectAsMap();
JavaRDD<Row> sampledRDD = rdd.sampleByKeyExact(false, fractions, 2L).values();
Dataset<Row> sampledData = spark.createDataFrame(sampledRDD, data.schema());
sampledData.show();
sampledData.printSchema();
}
}
Now let's package and submit : 现在让我们打包并提交:
$ sbt package
[...]
// [success] Total time: 2 s, completed Jan 16, 2017 1:45:51 PM
$ spark-submit --class org.awesomespark.examples.StratifiedDatasets target/scala-2.10/java-stratified-dataset_2.10-1.0.jar
[...]
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 0.0|(692,[127,128,129...|
// | 1.0|(692,[158,159,160...|
// | 1.0|(692,[124,125,126...|
// | 1.0|(692,[152,153,154...|
// | 1.0|(692,[151,152,153...|
// | 0.0|(692,[129,130,131...|
// | 1.0|(692,[99,100,101,...|
// | 0.0|(692,[154,155,156...|
// | 0.0|(692,[127,128,129...|
// | 1.0|(692,[154,155,156...|
// | 0.0|(692,[151,152,153...|
// | 1.0|(692,[129,130,131...|
// | 0.0|(692,[154,155,156...|
// | 1.0|(692,[150,151,152...|
// | 0.0|(692,[124,125,126...|
// | 0.0|(692,[152,153,154...|
// | 1.0|(692,[97,98,99,12...|
// | 1.0|(692,[124,125,126...|
// | 1.0|(692,[156,157,158...|
// | 1.0|(692,[127,128,129...|
// +-----+--------------------+
// only showing top 20 rows
// root
// |-- label: double (nullable = true)
// |-- features: vector (nullable = true)
For python users, you can also check my answer Stratified sampling with pyspark . 对于python用户,您还可以检查我的答案使用pyspark进行分层抽样 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.