使用Java和Spark从本地图像到HDFS写入序列文件

Question

As the title says, that is my objective now. 如标题所示，这就是我现在的目标。

I need to load a bunch of non-text files from a directory 我需要从目录加载一堆非文本文件
extract the usual file information from them (creation date, author, type... those ones) 从中提取通常的文件信息（创建日期，作者，类型...这些）
Create a sequence file of the type 创建类型的序列文件
put the fresh extracted info in the Key of the .seq file 将提取的新信息放入.seq文件的Key中
store all of them in a hdfs directory. 将它们全部存储在hdfs目录中。

The reason I'm using spark is for scalability (thousands of files to process and I will have a cluster of worker available) and because I'm thinking of implementing a SParkStreaming receiver on the image directory, so that the files will be processed automatically. 我使用spark的原因是为了实现可伸缩性（要处理数千个文件，并且我将有一组工作线程可用），并且因为我正在考虑在图像目录上实现SParkStreaming接收器，以便文件将被自动处理。 Here is my initial code: 这是我的初始代码：

JavaPairRDD<String, String> imageRDD = jsc.wholeTextFiles("file:///home/cloudera/Pictures/");

    imageRDD.mapToPair(new PairFunction<Tuple2<String,String>, Text, Text>() {

        @Override
        public Tuple2<Text, Text> call(Tuple2<String, String> arg0)
                throws Exception {
            return new Tuple2<Text, Text>(new Text(arg0._1),new Text(arg0._2));
        }

    }).saveAsNewAPIHadoopFile("hdfs://localhost:8020/user/hdfs/sparkling/try.seq", Text.class, Text.class, SequenceFileOutputFormat.class);

Here I load an image as a text file and create a tuple with the Text type from the hadoop library. 在这里，我将图像加载为文本文件，并从hadoop库中创建具有Text类型的元组。 This works, but: 这可行，但是：

The file isn't saved as a single one, but as a folder containing the partitions. 该文件不是保存为单个文件，而是保存为包含分区的文件夹。
It isn't an array of byte, but a text representation of the file. 它不是字节数组，而是文件的文本表示形式。 We all know how nagging it can be to reconvert from text to image (or whatever it is) 我们都知道从文本转换为图像（或任何形式）可能会令人na恼
If I load the files like this, Will there be a way to extract the required information? 如果我这样加载文件，是否可以提取所需的信息？

I've tried to load the files as aa sparkContext.binaryFiles(<directory>) , but I'm always lost as how to extract info and on how to save them. 我试图将文件加载为sparkContext.binaryFiles(<directory>) ，但是我总是迷失于如何提取信息以及如何保存它们。
I can't seem to find the answer in the internet: does anybody of you know something about this? 我似乎无法在互联网上找到答案：你们当中有人知道吗？

Answer 1

Here is how I did this: 这是我的操作方式：

JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);
        if(!imageByteRDD.isEmpty())
            imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {

            @Override
            public void call(
                    Iterator<Tuple2<String, PortableDataStream>> arg0)
                    throws Exception {
                Configuration conf = new Configuration();
                conf.set("fs.defaultFS", HDFS_PATH);
                while(arg0.hasNext()){
                    Tuple2<String,PortableDataStream>fileTuple = arg0.next();
                    Text key = new Text(fileTuple._1());
                    String fileName = key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-1].split(DOT_REGEX)[0];
                    String fileExtension = fileName.split(DOT_REGEX)[fileName.split(DOT_REGEX).length-1];

                      BytesWritable value = new BytesWritable( fileTuple._2().toArray());
                         SequenceFile.Writer writer = SequenceFile.createWriter(
                                 conf, 
                                 SequenceFile.Writer.file(new Path(DEST_PATH + fileName + SEP_KEY + getCurrentTimeStamp()+DOT+fileExtension)),
                               SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
                               SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
                         key = new Text(key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-2] + SEP_KEY + fileName + SEP_KEY + fileExtension);
                            writer.append(key, value);
                         IOUtils.closeStream(writer);

                }
            }
        });

使用Java和Spark从本地图像到HDFS写入序列文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-04 08:56:22

使用Java和Spark从本地图像到HDFS写入序列文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-04 08:56:22

解决方案1
1 已采纳 2016-07-04 08:56:22