[英]Writing a sequence file from an image in local to HDFS using Java and Spark
As the title says, that is my objective now. 如标题所示,这就是我现在的目标。
The reason I'm using spark is for scalability (thousands of files to process and I will have a cluster of worker available) and because I'm thinking of implementing a SParkStreaming receiver on the image directory, so that the files will be processed automatically. 我使用spark的原因是为了实现可伸缩性(要处理数千个文件,并且我将有一组工作线程可用),并且因为我正在考虑在图像目录上实现SParkStreaming接收器,以便文件将被自动处理。 Here is my initial code:
这是我的初始代码:
JavaPairRDD<String, String> imageRDD = jsc.wholeTextFiles("file:///home/cloudera/Pictures/");
imageRDD.mapToPair(new PairFunction<Tuple2<String,String>, Text, Text>() {
@Override
public Tuple2<Text, Text> call(Tuple2<String, String> arg0)
throws Exception {
return new Tuple2<Text, Text>(new Text(arg0._1),new Text(arg0._2));
}
}).saveAsNewAPIHadoopFile("hdfs://localhost:8020/user/hdfs/sparkling/try.seq", Text.class, Text.class, SequenceFileOutputFormat.class);
Here I load an image as a text file and create a tuple with the Text type from the hadoop library. 在这里,我将图像加载为文本文件,并从hadoop库中创建具有Text类型的元组。 This works, but:
这可行,但是:
I've tried to load the files as aa sparkContext.binaryFiles(<directory>)
, but I'm always lost as how to extract info and on how to save them. 我试图将文件加载为
sparkContext.binaryFiles(<directory>)
,但是我总是迷失于如何提取信息以及如何保存它们。
I can't seem to find the answer in the internet: does anybody of you know something about this? 我似乎无法在互联网上找到答案:你们当中有人知道吗?
Here is how I did this: 这是我的操作方式:
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);
if(!imageByteRDD.isEmpty())
imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {
@Override
public void call(
Iterator<Tuple2<String, PortableDataStream>> arg0)
throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", HDFS_PATH);
while(arg0.hasNext()){
Tuple2<String,PortableDataStream>fileTuple = arg0.next();
Text key = new Text(fileTuple._1());
String fileName = key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-1].split(DOT_REGEX)[0];
String fileExtension = fileName.split(DOT_REGEX)[fileName.split(DOT_REGEX).length-1];
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(
conf,
SequenceFile.Writer.file(new Path(DEST_PATH + fileName + SEP_KEY + getCurrentTimeStamp()+DOT+fileExtension)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text(key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-2] + SEP_KEY + fileName + SEP_KEY + fileExtension);
writer.append(key, value);
IOUtils.closeStream(writer);
}
}
});
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.