简体   繁体   English

在Java中将整数数组写入文件的最快方法?

[英]Fastest way to write an array of integers to a file in Java?

As the title says, I'm looking for the fastest possible way to write integer arrays to files. 正如标题所说,我正在寻找将整数数组写入文件的最快方法。 The arrays will vary in size, and will realistically contain anywhere between 2500 and 25 000 000 ints. 阵列的大小会有所不同,并且实际上可以包含2500到25 000 000个整数。

Here's the code I'm presently using: 这是我目前使用的代码:

DataOutputStream writer = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(filename)));

for (int d : data)
  writer.writeInt(d);

Given that DataOutputStream has a method for writing arrays of bytes, I've tried converting the int array to a byte array like this: 鉴于DataOutputStream有一个写字节数组的方法,我尝试将int数组转换为字节数组,如下所示:

private static byte[] integersToBytes(int[] values) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dos = new DataOutputStream(baos);
    for (int i = 0; i < values.length; ++i) {
        dos.writeInt(values[i]);
    }

    return baos.toByteArray();
}

and like this: 和这样:

private static byte[] integersToBytes2(int[] src) {
    int srcLength = src.length;
    byte[] dst = new byte[srcLength << 2];

    for (int i = 0; i < srcLength; i++) {
        int x = src[i];
        int j = i << 2;
        dst[j++] = (byte) ((x >>> 0) & 0xff);
        dst[j++] = (byte) ((x >>> 8) & 0xff);
        dst[j++] = (byte) ((x >>> 16) & 0xff);
        dst[j++] = (byte) ((x >>> 24) & 0xff);
    }
    return dst;
}

Both seem to give a minor speed increase, about 5%. 两者似乎都会提高速度,约为5%。 I've not tested them rigorously enough to confirm that. 我没有严格测试它们来证实这一点。

Are there any techniques that will speed up this file write operation, or relevant guides to best practice for Java IO write performance? 是否有任何技术可以加速此文件写入操作,或者有关Java IO写入性能的最佳实践的相关指南?

I had a look at three options: 我看了三个选项:

  1. Using DataOutputStream ; 使用DataOutputStream ;
  2. Using ObjectOutputStream (for Serializable objects, which int[] is); 使用ObjectOutputStream (对于Serializable对象, int[]是); and
  3. Using FileChannel . 使用FileChannel

The results are 结果是

DataOutputStream wrote 1,000,000 ints in 3,159.716 ms
ObjectOutputStream wrote 1,000,000 ints in 295.602 ms
FileChannel wrote 1,000,000 ints in 110.094 ms

So the NIO version is the fastest. 所以NIO版本是最快的。 It also has the advantage of allowing edits, meaning you can easily change one int whereas the ObjectOutputStream would require reading the entire array, modifying it and writing it out to file. 它还具有允许编辑的优点,这意味着您可以轻松地更改一个int,而ObjectOutputStream需要读取整个数组,修改它并将其写入文件。

Code follows: 代码如下:

private static final int NUM_INTS = 1000000;

interface IntWriter {
  void write(int[] ints);
}

public static void main(String[] args) {
  int[] ints = new int[NUM_INTS];
  Random r = new Random();
  for (int i=0; i<NUM_INTS; i++) {
    ints[i] = r.nextInt();
  }
  time("DataOutputStream", new IntWriter() {
    public void write(int[] ints) {
      storeDO(ints);
    }
  }, ints);
  time("ObjectOutputStream", new IntWriter() {
    public void write(int[] ints) {
      storeOO(ints);
    }
  }, ints);
  time("FileChannel", new IntWriter() {
    public void write(int[] ints) {
      storeFC(ints);
    }
  }, ints);
}

private static void time(String name, IntWriter writer, int[] ints) {
  long start = System.nanoTime();
  writer.write(ints);
  long end = System.nanoTime();
  double ms = (end - start) / 1000000d;
  System.out.printf("%s wrote %,d ints in %,.3f ms%n", name, ints.length, ms);
}

private static void storeOO(int[] ints) {
  ObjectOutputStream out = null;
  try {
    out = new ObjectOutputStream(new FileOutputStream("object.out"));
    out.writeObject(ints);
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
  }
}

private static void storeDO(int[] ints) {
  DataOutputStream out = null;
  try {
    out = new DataOutputStream(new FileOutputStream("data.out"));
    for (int anInt : ints) {
      out.write(anInt);
    }
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
  }
}

private static void storeFC(int[] ints) {
  FileOutputStream out = null;
  try {
    out = new FileOutputStream("fc.out");
    FileChannel file = out.getChannel();
    ByteBuffer buf = file.map(FileChannel.MapMode.READ_WRITE, 0, 4 * ints.length);
    for (int i : ints) {
      buf.putInt(i);
    }
    file.close();
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
  }
}

private static void safeClose(OutputStream out) {
  try {
    if (out != null) {
      out.close();
    }
  } catch (IOException e) {
    // do nothing
  }
}

I would use FileChannel from the nio package and ByteBuffer . 我会使用nio包和ByteBuffer FileChannel This approach seems (on my computer) give 2 to 4 times better write performance : 这种方法似乎(在我的计算机上)提高了2到4倍的写入性能

Output from program: 程序输出:

normal time: 2555
faster time: 765

This is the program: 这是该计划:

public class Test {

    public static void main(String[] args) throws IOException {

        // create a test buffer
        ByteBuffer buffer = createBuffer();

        long start = System.currentTimeMillis();
        {
            // do the first test (the normal way of writing files)
            normalToFile(new File("first"), buffer.asIntBuffer());
        }
        long middle = System.currentTimeMillis(); 
        {
            // use the faster nio stuff
            fasterToFile(new File("second"), buffer);
        }
        long done = System.currentTimeMillis();

        // print the result
        System.out.println("normal time: " + (middle - start));
        System.out.println("faster time: " + (done - middle));
    }

    private static void fasterToFile(File file, ByteBuffer buffer) 
    throws IOException {

        FileChannel fc = null;

        try {

            fc = new FileOutputStream(file).getChannel();
            fc.write(buffer);

        } finally {

            if (fc != null)
                fc.close();

            buffer.rewind();
        }
    }

    private static void normalToFile(File file, IntBuffer buffer) 
    throws IOException {

        DataOutputStream writer = null;

        try {
            writer = 
                new DataOutputStream(new BufferedOutputStream(
                        new FileOutputStream(file)));

            while (buffer.hasRemaining())
                writer.writeInt(buffer.get());

        } finally {
            if (writer != null)
                writer.close();

            buffer.rewind();
        }
    }

    private static ByteBuffer createBuffer() {
        ByteBuffer buffer = ByteBuffer.allocate(4 * 25000000);
        Random r = new Random(1);

        while (buffer.hasRemaining()) 
            buffer.putInt(r.nextInt());

        buffer.rewind();

        return buffer;
    }
}

I think you should consider using file channels (the java.nio library) instead of plain streams (java.io). 我认为你应该考虑使用文件通道(java.nio库)而不是普通流(java.io)。 A good starting point is this interesting discussion: Java NIO FileChannel versus FileOutputstream performance / usefulness 一个很好的起点是这个有趣的讨论: Java NIO FileChannel与FileOutputstream的性能/实用性

and the relevant comments below. 以及下面的相关评论。

Cheers! 干杯!

The main improvement you can have for writing int[] is to either; 编写int []的主要改进是:

  • increase the buffer size. 增加缓冲区大小。 The size is right for most stream, but file access can be faster with a larger buffer. 大小适合大多数流,但使用更大的缓冲区可以更快地访问文件。 This could yield a 10-20% improvement. 这可以产生10-20%的改善。

  • Use NIO and a direct buffer. 使用NIO和直接缓冲区。 This allows you to write 32-bit values without converting to bytes. 这允许您编写32位值而无需转换为字节。 This may yield a 5% improvement. 这可能会带来5%的改善。

BTW: You should be able to write at least 10 million int values per second. 顺便说一句:你应该能够每秒写入至少1000万个int值。 With disk caching you increase this to 200 million per second. 使用磁盘缓存,您可以将其增加到每秒2亿。

Benchmarks should be repeated every once in a while, shouldn't they? 基准应该每隔一段时间重复一次,不是吗? :) After fixing some bugs and adding my own writing variant, here are the results I get when running the benchmark on an ASUS ZenBook UX305 running Windows 10 (times given in seconds): :)修复了一些错误并添加了我自己的写入变体后,这是我在运行Windows 10的ASUS ZenBook UX305上运行基准测试时获得的结果(以秒为单位的时间):

Running tests... 0 1 2
Buffered DataOutputStream           8,14      8,46      8,30
FileChannel alt2                    1,55      1,18      1,12
ObjectOutputStream                  9,60     10,41     11,68
FileChannel                         1,49      1,20      1,21
FileChannel alt                     5,49      4,58      4,66

And here are the results running on the same computer but with Arch Linux and the order of the write methods switched: 以下是在同一台计算机上运行的结果,但是使用Arch Linux并且切换了写入方法的顺序:

Running tests... 0 1 2
Buffered DataOutputStream          31,16      6,29      7,26
FileChannel                         1,07      0,83      0,82
FileChannel alt2                    1,25      1,71      1,42
ObjectOutputStream                  3,47      5,39      4,40
FileChannel alt                     2,70      3,27      3,46

Each test wrote an 800mb file. 每个测试都写了一个800mb的文件。 The unbuffered DataOutputStream took way to long so I excluded it from the benchmark. 无缓冲的DataOutputStream占用了很长时间,因此我将其排除在基准测试之外。

As seen, writing using a file channel still beats the crap out of all other methods, but it matters a lot whether the byte buffer is memory-mapped or not. 如图所示,使用文件通道写入仍然胜过所有其他方法的废话,但是字节缓冲区是否是内存映射是很重要的。 Without memory-mapping the file channel write took 3-5 seconds: 没有内存映射文件通道写入需要3-5秒:

var bb = ByteBuffer.allocate(4 * ints.length);
for (int i : ints)
    bb.putInt(i);
bb.flip();
try (var fc = new FileOutputStream("fcalt.out").getChannel()) {
    fc.write(bb);
}

With memory-mapping, the time was reduced to between 0.8 to 1.5 seconds: 使用内存映射,时间减少到0.8到1.5秒之间:

try (var fc = new RandomAccessFile("fcalt2.out", "rw").getChannel()) {
    var bb = fc.map(READ_WRITE, 0, 4 * ints.length);
    bb.asIntBuffer().put(ints);
}

But note that the results are order-dependent. 但请注意,结果依赖于顺序。 Especially so on Linux. 特别是在Linux上。 It appears that the memory-mapped methods doesn't write the data in full but rather offloads the job request to the OS and returns before it is completed. 看起来内存映射方法不会完全写入数据,而是将作业请求卸载到OS并在完成之前返回。 Whether that behaviour is desirable or not depends on the situation. 这种行为是否可取取决于具体情况。

Memory-mapping can also lead to OutOfMemory problems so it is not always the right tool to use. 内存映射也可能导致OutOfMemory问题,因此它并不总是正确的工具。 Prevent OutOfMemory when using java.nio.MappedByteBuffer . 使用java.nio.MappedByteBuffer时防止OutOfMemory

Here is my version of the benchmark code: https://gist.github.com/bjourne/53b7eabc6edea27ffb042e7816b7830b 这是我的基准代码版本: https//gist.github.com/bjourne/53b7eabc6edea27ffb042e7816b7830b

Array is Serializable - can't you just use writer.writeObject(data); 数组是Serializable - 你不能只使用writer.writeObject(data); ? That's definitely going to be faster than individual writeInt calls. writeInt单个writeInt调用更快。

If you have other requirements on the output data format than retrieval into int[] , that's a different question. 如果您对输出数据格式有其他要求而不是检索到int[] ,那么这是一个不同的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM