简体   繁体   English

将文件列表作为Java 8 Stream读取

[英]Reading a list of Files as a Java 8 Stream

I have a (possibly long) list of binary files that I want to read lazily. 我有一个(可能很长)二进制文件列表,我想懒惰地阅读。 There will be too many files to load into memory. 将有太多文件加载到内存中。 I'm currently reading them as a MappedByteBuffer with FileChannel.map() , but that probably isn't required. 我目前正在使用FileChannel.map()它们作为MappedByteBuffer读取,但这可能不是必需的。 I want the method readBinaryFiles(...) to return a Java 8 Stream so I can lazy load the list of files as I access them. 我希望方法readBinaryFiles(...)返回Java 8 Stream,这样我就可以在访问文件时懒惰加载文件列表。

    public List<FileDataMetaData> readBinaryFiles(
    List<File> files, 
    int numDataPoints, 
    int dataPacketSize )
    throws
    IOException {

    List<FileDataMetaData> fmdList = new ArrayList<FileDataMetaData>();

    IOException lastException = null;
    for (File f: files) {

        try {
            FileDataMetaData fmd = readRawFile(f, numDataPoints, dataPacketSize);
            fmdList.add(fmd);
        } catch (IOException e) {
            logger.error("", e);
            lastException = e;
        }
    }

    if (null != lastException)
        throw lastException;

    return fmdList;
}


//  The List<DataPacket> returned will be in the same order as in the file.
public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize) throws IOException {

    FileDataMetaData fmd;
    FileChannel fileChannel = null;
    try {
        fileChannel = new RandomAccessFile(file, "r").getChannel();
        long fileSz = fileChannel.size();
        ByteBuffer bbRead = ByteBuffer.allocate((int) fileSz);
        MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);

        buffer.get(bbRead.array());
        List<DataPacket> dataPacketList = new ArrayList<DataPacket>();

        while (bbRead.hasRemaining()) {

            int channelId = bbRead.getInt();
            long timestamp = bbRead.getLong();
            int[] data = new int[numDataPoints];
            for (int i=0; i<numDataPoints; i++) 
                data[i] = bbRead.getInt();

            DataPacket dp = new DataPacket(channelId, timestamp, data);
            dataPacketList.add(dp);
        }

        fmd = new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);

    } catch (IOException e) {
        logger.error("", e);
        throw e;
    } finally {
        if (null != fileChannel) {
            try {
                fileChannel.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    return fmd;
}

Returning fmdList.Stream() from readBinaryFiles(...) won't accomplish this because the file contents will already have been read into memory, which I won't be able to do. readBinaryFiles(...)返回fmdList.Stream()将无法完成此操作,因为文件内容已经被读入内存,我将无法做到。

The other approaches to reading the contents of multiple files as a Stream rely on using Files.lines() , but I need to read binary files. 将多个文件的内容作为Stream读取的其他方法依赖于使用Files.lines() ,但我需要读取二进制文件。

I'm, open to doing this in Scala or golang if those languages have better support for this use case than Java. 我愿意在Scala或Golang中这样做,如果这些语言比Java更能支持这个用例。

I'd appreciate any pointers on how to read the contents of multiple binary files lazily. 我很感激任何关于如何懒惰地阅读多个二进制文件内容的指针。

There is no laziness possible for the reading within the a file as you are reading the entire file for constructing an instance of FileDataMetaData . 有没有为你正在阅读的整个文件构造的一个实例的一个文件读取懒惰可能FileDataMetaData You would need a substantial refactoring of that class to be able to construct an instance of FileDataMetaData without having to read the entire file. 您需要对该类进行大量重构才能构建FileDataMetaData实例,而无需读取整个文件。

However, there are several things to clean up in that code, even specific to Java 7 rather than Java 8, ie you don't need a RandomAccessFile detour to open a channel anymore and there is try-with-resources to ensure proper closing. 但是,在该代码中有几个要清理的东西,甚至特定于Java 7而不是Java 8,即您不再需要RandomAccessFile绕道来打开一个通道,并且尝试使用资源以确保正确关闭。 Note further that you usage of memory mapping makes no sense. 请进一步注意,使用内存映射毫无意义。 When copy the entire contents into a heap ByteBuffer after mapping the file, there is nothing lazy about it. 在映射文件后将整个内容复制到堆ByteBuffer ,没有什么是懒惰的。 It's exactly the same what happens, when call read with a heap ByteBuffer on a channel, except that the JRE can reuse buffers in the read case. 当在通道上使用堆ByteBuffer调用read时,情况完全相同,除了JRE可以在read情况下重用缓冲区。

In order to allow the system to manage the pages, you have to read from the mapped byte buffer. 为了允许系统管理页面,您必须从映射的字节缓冲区中读取。 Depending on the system, this might still not be better than repeatedly reading small chunks into a heap byte buffer. 根据系统的不同,这可能仍然不会比将小块重复读入堆字节缓冲区更好。

public FileDataMetaData readRawFile(
    File file, int numDataPoints, int dataPacketSize) throws IOException {

    try(FileChannel fileChannel=FileChannel.open(file.toPath(), StandardOpenOption.READ)) {
        long fileSz = fileChannel.size();
        MappedByteBuffer bbRead=fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);
        List<DataPacket> dataPacketList = new ArrayList<>();
        while(bbRead.hasRemaining()) {
            int channelId = bbRead.getInt();
            long timestamp = bbRead.getLong();
            int[] data = new int[numDataPoints];
            for (int i=0; i<numDataPoints; i++) 
                data[i] = bbRead.getInt();
            dataPacketList.add(new DataPacket(channelId, timestamp, data));
        }
        return new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);
    } catch (IOException e) {
        logger.error("", e);
        throw e;
    }
}

Building a Stream based on this method is straight-forward, only the checked exception has to be handled: 基于此方法构建Stream非常简单,只需处理已检查的异常:

public Stream<FileDataMetaData> readBinaryFiles(
    List<File> files, int numDataPoints, int dataPacketSize) throws IOException {
    return files.stream().map(f -> {
        try {
            return readRawFile(f, numDataPoints, dataPacketSize);
        } catch (IOException e) {
            logger.error("", e);
            throw new UncheckedIOException(e);
        }
    });
}

I don't know how performant this is, but you can use java.io.SequenceInputStream wrapped inside of DataInputStream . 我不知道这是多么java.io.SequenceInputStream ,但你可以使用java.io.SequenceInputStream包装在DataInputStream This will effectively concatenate your files together. 这将有效地将您的文件连接在一起。 If you create a BufferedInputStream from each file, then the whole thing should be properly buffered. 如果从每个文件创建一个BufferedInputStream ,那么应该正确缓冲整个事情。

Building on VGR's comment , I think his basic solution of: 基于VGR的评论 ,我认为他的基本解决方案是:

return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize))

is correct, in that it will lazily process the files (and stop if a short-circuiting terminal action is invoked off the result of the map() operation. I would also suggest a slightly different to the implementation of readRawFile that leverages try with resources and InputStream, which will not load the whole file into memory: 是正确的,因为它将懒惰地处理文件(并且如果从map()操作的结果调用短路终端动作则停止。我还建议利用try资源的readRawFile的实现略有不同和InputStream,它不会将整个文件加载到内存中:

public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize)
  throws DataPacketReadException { // <- Custom unchecked exception, nested for class

  FileDataMetadata results = null;

  try (FileInputStream fileInput = new FileInputStream(file)) {
    String filePath = file.getCanonicalPath();
    long fileSize = fileInput.getChannel().size()

    DataInputStream dataInput = new DataInputStream(new BufferedInputStream(fileInput);

    results = new FileDataMetadata(
      filePath, 
      fileSize,
      dataPacketsFrom(dataInput, numDataPoints, dataPacketSize, filePath);
  }

  return results;
}

private List<DataPacket> dataPacketsFrom(DataInputStream dataInput, int numDataPoints, int dataPacketSize, String filePath)
    throws DataPacketReadException { 

  List<DataPacket> packets = new 
  while (dataInput.available() > 0) {
    try {
      // Logic to assemble DataPacket
    }
    catch (EOFException e) {
      throw new DataPacketReadException("Unexpected EOF on file: " + filePath, e);
    }
    catch (IOException e) {
      throw new DataPacketReadException("Unexpected I/O exception on file: " + filePath, e);
    }
  }

  return packets;
}

This should reduce the amount of code, and make sure that your files get closed on error. 这应该减少代码量,并确保您的文件在出错时关闭。

This should be sufficient: 这应该足够了:

return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize));

…if, that is, you are willing to remove throws IOException from the readRawFile method's signature. ...如果,也就是说,您愿意从readRawFile方法的签名中删除throws IOException You could have that method catch IOException internally and wrap it in an UncheckedIOException . 您可以让该方法在内部捕获IOException并将其包装在UncheckedIOException中 (The problem with deferred execution is that the exceptions also need to be deferred.) (延迟执行的问题是异常也需要延迟。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM