Java读取和写入Spark矢量到Hdfs

Question

I wrote Vector's (org.apache.spark.mllib.linalg.Vector) to the HDFS as the following 我将Vector的(org.apache.spark.mllib.linalg.Vector)写入HDFS ，如下所示

public void writePointsToFile(Path path, FileSystem fs, Configuration conf,
        List<Vector> points) throws IOException {

    SequenceFile.Writer writer = SequenceFile.createWriter(conf,
            Writer.file(path), Writer.keyClass(LongWritable.class),
            Writer.valueClass(Vector.class));

    long recNum = 0;

    for (Vector point : points) {
        writer.append(new LongWritable(recNum++), point);
    }
    writer.close();
}

( not sure that I used the right way to do that can't test it yet ) （不确定我用正确的方法做到这一点还不能测试）

now I need to read this file as JavaRDD<Vector> because I want to use it in Spark Clustering K-mean but don't know how to do this. 现在我需要将此文件读作JavaRDD<Vector>因为我想在Spark Clustering K-mean使用它但不知道如何执行此操作。

Answer 1

Spark directly supports reading Hadoop SequenceFiles. Spark直接支持读取Hadoop SequenceFiles。 You would do something like: 你会做的事情如下：

JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<LongWritable, Vector> input = 
    sc.sequenceFile(fileName, LongWritable.class, Vector.class);

You then just need to convert the JavaPairRDD<LongWritable, Vector> into a JavaRDD<Vector> . 然后，您只需要将JavaPairRDD<LongWritable, Vector>转换为JavaRDD<Vector> 。

JavaRDD<Vector> out = input.map(new Function<Tuple2<LongWritable, Vector>, Vector>() {

    @Override
    public Vector call(Tuple2<LongWritable, Vector> tuple) throws Exception {
        return tuple._2();
    }
});

Java读取和写入Spark矢量到Hdfs

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-07-20 08:18:41

Java读取和写入Spark矢量到Hdfs

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-07-20 08:18:41

解决方案1
0 已采纳 2016-07-20 08:18:41