简体   繁体   中英

How to get/build a JavaRDD[DataSet]?

When I use deeplearning4j and try to train a model in Spark

public MultiLayerNetwork fit(JavaRDD<DataSet> trainingData)

fit() need a JavaRDD parameter, I try to build like this

    val totalDaset = csv.map(row => {
      val features = Array(
        row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
      )
      val labels = Array(row.getAs[String](21).toDouble)
      val featuresINDA = Nd4j.create(features)
      val labelsINDA = Nd4j.create(labels)
      new DataSet(featuresINDA, labelsINDA)
    })

but the tip of IDEA is No implicit arguments of type:Encode[DataSet]
it's a error and I dont know how to solve this problem,
I know SparkRDD can transform to JavaRDD, but I dont know how to build a Spark RDD[DataSet]
DataSet is in import org.nd4j.linalg.dataset.DataSet
Its construction method is

    public DataSet(INDArray first, INDArray second) {
        this(first, second, (INDArray)null, (INDArray)null);
    }

this is my code

val spark:SparkSession = {SparkSession
      .builder()
      .master("local")
      .appName("Spark LSTM Emotion Analysis")
      .getOrCreate()
    }
    import spark.implicits._
    val JavaSC = JavaSparkContext.fromSparkContext(spark.sparkContext)

    val csv=spark.read.format("csv")
      .option("header","true")
      .option("sep",",")
      .load("/home/hadoop/sparkjobs/LReg/data.csv")

    val totalDataset = csv.map(row => {
      val features = Array(
        row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
      )
      val labels = Array(row.getAs[String](21).toDouble)
      val featuresINDA = Nd4j.create(features)
      val labelsINDA = Nd4j.create(labels)
      new DataSet(featuresINDA, labelsINDA)
    })

    val data = totalDataset.toJavaRDD

create JavaRDD[DataSet] by Java in deeplearning4j official guide:

String filePath = "hdfs:///your/path/some_csv_file.csv";
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<String> rddString = sc.textFile(filePath);
RecordReader recordReader = new CSVRecordReader(',');
JavaRDD<List<Writable>> rddWritables = rddString.map(new StringToWritablesFunction(recordReader));

int labelIndex = 5;         //Labels: a single integer representing the class index in column number 5
int numLabelClasses = 10;   //10 classes for the label
JavaRDD<DataSet> rddDataSetClassification = rddWritables.map(new DataVecDataSetFunction(labelIndex, numLabelClasses, false));

I try to create by scala:

    val JavaSC: JavaSparkContext = new JavaSparkContext()
    val rddString: JavaRDD[String] = JavaSC.textFile("/home/hadoop/sparkjobs/LReg/hf-data.csv")
    val recordReader: CSVRecordReader = new CSVRecordReader(',')
    val rddWritables: JavaRDD[List[Writable]] = rddString.map(new StringToWritablesFunction(recordReader))
    val featureColnum = 3
    val labelColnum = 1
    val d = new DataVecDataSetFunction(featureColnum,labelColnum,true,null,null)
//    val rddDataSet: JavaRDD[DataSet] = rddWritables.map(new DataVecDataSetFunction(featureColnum,labelColnum, true,null,null))
// can not reslove overloaded method 'map'

debug error infomations:

在此处输入图像描述

A DataSet is just a pair of INDArrays. (inputs and labels) Our docs cover this in depth: https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto

For stack overflow sake, I'll summarize what's here since there's no "1" way to create a data pipeline. It's relative to your problem. It's very similar to how you you would create a dataset locally, generally you want to take whatever you do locally and put that in to spark in a function.

CSVs and images for example are going to be very different. But generally you use the datavec library to do that. The docs summarize the approach for each kind.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM