簡體   English   中英

將JavaRDD字符串轉換為JavaRDD向量

[英]Convert a JavaRDD String to JavaRDD Vector

我正在嘗試將csv文件作為JavaRDD字符串加載,然后要獲取JavaRDD Vector中的數據

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;

import breeze.collection.mutable.SparseArray;
import scala.collection.immutable.Seq;




public class Trial {
    public void start() throws InstantiationException, IllegalAccessException,
    ClassNotFoundException {

        run();
    }


    private void run(){
SparkConf conf = new SparkConf().setAppName("csvparser");
JavaSparkContext jsc = new JavaSparkContext(conf);
        JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.flatMap(null);
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());

        System.out.println(mat.mean());


    }

    private List<Vector> Seq(Vector dv) {
        // TODO Auto-generated method stub
        return null;
    }


    public static void main(String[] args) throws Exception {

        Trial trial = new Trial();
        trial.start();
    }
}

該程序正在運行,沒有任何錯誤,但是當嘗試在火花機上運行時,我什么也沒得到。 誰能告訴我字符串RDD到Vector RDD的轉換是否正確。

我的csv文件僅包含一列為浮點數

flatMap調用中的null可能是一個問題:

JavaRDD<Vector> datamain = data.flatMap(null);

我通過將代碼更改為此解決了我的答案

JavaRDD<Vector> datamain = data.map(new Function<String,Vector>(){
            public Vector call(String s){
                String[] sarray = s.trim().split("\\r?\\n");
                double[] values = new double[sarray.length];
                for (int i = 0; i < sarray.length; i++) {
                  values[i] = Double.parseDouble(sarray[i]);
                  System.out.println(values[i]);
                }
                return Vectors.dense(values);  
                }
            }
        );

假設您的trial.csv文件如下所示

1.0
2.0
3.0

Java 8需要將問題中的原始代碼更改為一行

SparkConf conf = new SparkConf().setAppName("csvparser").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.map(s -> Vectors.dense(Double.parseDouble(s)));
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());

System.out.println(mat.mean());

打印2.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM