简体   繁体   中英

Best way to write huge number of files

I am writing a lots of files like bellow.

public void call(Iterator<Tuple2<Text, BytesWritable>> arg0)
        throws Exception {
    // TODO Auto-generated method stub

    while (arg0.hasNext()) {
        Tuple2<Text, BytesWritable> tuple2 = arg0.next();
        System.out.println(tuple2._1().toString());
        PrintWriter writer = new PrintWriter("/home/suv/junk/sparkOutPut/"+tuple2._1().toString(), "UTF-8");
        writer.println(new String(tuple2._2().getBytes()));
        writer.close();
    }
}

Is there any better way to write the files..without closing or creating printwriter every time.

There is no significantly better way to write lots of files. What you are doing is inherently I/O intensive.

UPDATE - @Michael Anderson is right, I think. Using multiple threads to write the files (probably) will speed things up considerably. However, the I/O is still going to be the ultimate bottleneck from a couple of respects:

  • Creating, opening and closing files involves file & directory metadata access and update. This entails non-trivial CPU.

  • The file data and metadata changes need to be written to disc. That is possibly multiple disc writes.

  • There are at least 3 syscalls for each file written.

  • Then there are thread stitching overheads.

Unless the quantity of data written to each file is significant (multiple kilobytes per file), I doubt that the techniques like using NIO, direct buffers, JNI and so on will be worthwhile. The real bottlenecks will be in the kernel: file system operations and low-level disk I/O.


... without closing or creating printwriter every time.

No. You need to create a new PrintWriter ( or Writer or OutputStream ) for each file.

However, this ...

  writer.println(new String(tuple2._2().getBytes()));

... looks rather peculiar. You appear to be:

  • calling getBytes() on a String (?),
  • converting the byte array to a String
  • calling the println() method on the String which will copy it, and the convert it back into bytes before finally outputting them.

What gives? What is the point of the String -> bytes -> String conversion?

I'd just do this:

  writer.println(tuple2._2());

This should be faster, though I wouldn't expect the percentage speed-up to be that large.

I'm assuming you're after the fastest way. Because everyone knows fastest is best ;)

One simple way is to use a bunch of threads to do your writing for you. However you're not going to get much benefit by doing this unless your filesystem scales well. (I use this technique on Luster based cluster systems, and in cases where "lots of files" could mean 10k - in this case many of the writes will be going to different servers / disks)

The code would look something like this: (Note I think this version is not right as for small numbers of files this fills the work queue - but see the next version for the better version anyway...)

public void call(Iterator<Tuple2<Text, BytesWritable>> arg0) throws Exception {
    int nThreads=5;
    ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
    ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);

    int nJobs = 0;

    while (arg0.hasNext()) {
        ++nJobs;
        final Tuple2<Text, BytesWritable> tuple2 = arg0.next();
        ecs.submit(new Callable<Void>() {
          @Override Void call() {
             System.out.println(tuple2._1().toString());
             String path = "/home/suv/junk/sparkOutPut/"+tuple2._1().toString();
             try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
               writer.println(new String(tuple2._2().getBytes()))
             }
             return null;
          }
       });
    }
    for(int i=0; i<nJobs; ++i) {
       ecs.take().get();
    }
}

Better yet is to start writing your files as soon as you have data for the first one, not when you've got data for all of them - and for this writing to not block the calculation thread(s).

To do this you split your application into several pieces communicating over a (thread safe) queue.

Code then ends up looking more like this:

public void main() {
  SomeMultithreadedQueue<Data> queue = ...;

  int nGeneratorThreads=1;
  int nWriterThreads=5;
  int nThreads = nGeneratorThreads + nWriterThreads;

  ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
  ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);

  AtomicInteger completedGenerators = new AtomicInteger(0);

  // Start some generator threads.
  for(int i=0; ++i; i<nGeneratorThreads) {
    ecs.submit( () -> { 
      while(...) { 
        Data d = ... ;
        queue.push(d);
      }
      if(completedGenerators.incrementAndGet()==nGeneratorThreads) {
        queue.push(null);
      }
      return null;
   });
  }

  // Start some writer threads
  for(int i=0; i<nWriterThreads; ++i) {
    ecs.submit( () -> { 
      Data d
      while((d = queue.take())!=null) {
        String path = data.path();
        try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
           writer.println(new String(data.getBytes()));
        }
        return null;
      }
    });
  }

  for(int i=0; i<nThreads; ++i) {
    ecs.take().get();
  }
}

Note I've not provided an implementation of the queue class you can easily wrap the standard java threadsafe ones to get what you need.

There's still lots more that can be done to reduce latency, etc - heres some of the further things I've used to get the times down ...

  1. don't even wait for all the data to be generated for a given file. Pass another queue containing packets of bytes to write.

  2. Watch out for allocations - you can reuse some of your buffers.

  3. There's some latency in the nio stuff - you can get some performance improvements by using C writes and JNI and direct buffers.

  4. Thread switching can hurt, and the latency in the queues can hurt, so you might want to batch up your data slightly. Balancing this with 1 can be tricky.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM