[英]Java Read and Write Spark Vector's to Hdfs
I wrote Vector's (org.apache.spark.mllib.linalg.Vector)
to the HDFS
as the following 我将Vector的(org.apache.spark.mllib.linalg.Vector)
写入HDFS
,如下所示
public void writePointsToFile(Path path, FileSystem fs, Configuration conf,
List<Vector> points) throws IOException {
SequenceFile.Writer writer = SequenceFile.createWriter(conf,
Writer.file(path), Writer.keyClass(LongWritable.class),
Writer.valueClass(Vector.class));
long recNum = 0;
for (Vector point : points) {
writer.append(new LongWritable(recNum++), point);
}
writer.close();
}
( not sure that I used the right way to do that can't test it yet ) (不确定我用正确的方法做到这一点还不能测试)
now I need to read this file as JavaRDD<Vector>
because I want to use it in Spark Clustering K-mean
but don't know how to do this. 现在我需要将此文件读作JavaRDD<Vector>
因为我想在Spark Clustering K-mean
使用它但不知道如何执行此操作。
Spark directly supports reading Hadoop SequenceFiles. Spark直接支持读取Hadoop SequenceFiles。 You would do something like: 你会做的事情如下:
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<LongWritable, Vector> input =
sc.sequenceFile(fileName, LongWritable.class, Vector.class);
You then just need to convert the JavaPairRDD<LongWritable, Vector>
into a JavaRDD<Vector>
. 然后,您只需要将JavaPairRDD<LongWritable, Vector>
转换为JavaRDD<Vector>
。
JavaRDD<Vector> out = input.map(new Function<Tuple2<LongWritable, Vector>, Vector>() {
@Override
public Vector call(Tuple2<LongWritable, Vector> tuple) throws Exception {
return tuple._2();
}
});
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.