简体   繁体   中英

Spark Reading .7z files

I am trying to read the spark .7z files using scala or java. I dont find any appropriate methods or functionality.

For the zip file, i am able to read as the ZipInputStream class takes a Input stream, but for the 7Z files the class SevenZFile doesnt take any input stream. https://commons.apache.org/proper/commons-compress/javadocs/api-1.16/org/apache/commons/compress/archivers/sevenz/SevenZFile.html

Zip file code

spark.sparkContext.binaryFiles("fileName").flatMap{case (name: String, content: PortableDataStream) =>
        val zis = new ZipInputStream(content.open)
        Stream.continually(zis.getNextEntry)
              .takeWhile(_ != null)
              .flatMap { _ =>
                  val br = new BufferedReader(new InputStreamReader(zis))
                  Stream.continually(br.readLine()).takeWhile(_ != null)
              }}

I am trying similar code for the 7z files something like

spark.sparkContext.binaryFiles(""filename"").flatMap{case (name: String, content: PortableDataStream) =>
        val zis = new SevenZFile(content.open)
        Stream.continually(zis.getNextEntry)
              .takeWhile(_ != null)
              .flatMap { _ =>
                  val br = new BufferedReader(new InputStreamReader(zis))
                  Stream.continually(br.readLine()).takeWhile(_ != null)
              }}

But SevenZFile doesnt accept these formats.Looking for ideas.

If the file is in local filessytem following solution works, but my file is in hdfs

Local fileSystem Code

 public static void decompress(String in, File destination) throws IOException {
        SevenZFile sevenZFile = new SevenZFile(new File(in));
        SevenZArchiveEntry entry;
        while ((entry = sevenZFile.getNextEntry()) != null){
            if (entry.isDirectory()){
                continue;
            }
            File curfile = new File(destination, entry.getName());
            File parent = curfile.getParentFile();
            if (!parent.exists()) {
                parent.mkdirs();
            }
            FileOutputStream out = new FileOutputStream(curfile);
            byte[] content = new byte[(int) entry.getSize()];
            sevenZFile.read(content, 0, content.length);
            out.write(content);
            out.close();
        }
    }

After all these years of spark evolution there should be easy way to do it.

Instead of using the java.io.File -based approach, you could try the SeekableByteChannel method as shown in this alternative constructor .

You can use a SeekableInMemoryByteChannel to read a byte array. So as long as you can pick up the 7zip files from S3 or whatever and hand them off as byte arrays you should be alright.

With all of that said, Spark is really not well-suited for processing things like zip and 7zip files. I can tell you from personal experience I've seen it fail badly once the files are too large for Spark's executors to handle.

Something like Apache NiFi will work much better for expanding archives and processing them. FWIW, I'm currently handling a large data dump that has me frequently dealing with 50GB tarballs that have several million files in them, and NiFi handles them very gracefully.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM