简体   繁体   English

使用Java 8处理和拆分大文件

[英]Processing and splitting large files with Java 8

I'm new to Java 8 and I have just started using the NIO package for file-handling. 我是Java 8的新手,我刚开始使用NIO包进行文件处理。 I need help in how to process large files--varying from 100,000 lines to 1,000,000 lines per file--by transforming each line into a specific format and writing the formatted lines to new files. 我需要帮助处理大型文件 - 每个文件从100,000行到1,000,000行 - 通过将每行转换为特定格式并将格式化的行写入新文件。 The new file(s) generated must only contain a maximum of 100,000 lines per file. 生成的新文件每个文件最多只能包含100,000行。 So: 所以:

  • if I have a 500,000-line file for processing, I must transform those lines and distribute and print them on 5 new files. 如果我有一个500,000行的文件进行处理,我必须转换这些行并在5个新文件上分发和打印它们。
  • if I have a 745,000-line file for processing, I must transform those lines and print them on 8 new files. 如果我有一个745,000行的文件进行处理,我必须转换这些行并在8个新文件上打印它们。

I'm having a hard time figuring out an approach that will efficiently utilize the new features of Java 8. I've started out with determining the number of new files to be generated based on the line count of the large file, and then creating those new empty files: 我很难找到一种能够有效利用Java 8新功能的方法。我已经开始根据大文件的行数确定要生成的新文件的数量,然后创建那些新的空文件:

Path largFile = Path.get("path\to\file");
long recordCount = Files.lines(file).count();
int maxRecordOfNewFiles = 100000;
int numberOfNewFiles =  1;
if (recordCount > maxRecordOfNewFiles) {
    numberOfNewFiles = Math.toIntExact(recordCount / maxRecordOfNewFiles);
if (Math.toIntExact(recordCount % maxRecordOfNewFiles) > 0) {
    numberOfNewFiles ++;
}
}

IntStream.rangeClosed(1, numberOfNewFiles).forEach((i) 
    -> {
        try {
            Path newFile = Paths.get("path\to\newFiles\newFile1.txt");
                        Files.createFile(cdpFile);
         } catch (IOException iOex) {
         }
        });

But as I go through the the lines of the largeFile through the Files.lines(largeFile).forEach(()) capability, I got lost on how to proceed with formatting the first 100,000 lines and then determining the first of the new files and printing them on that file, and then the second batch of 100,000 to the second new file, and so on. 但是,当我经过的线路largeFile通过Files.lines(largeFile).forEach(())的能力,我迷路了有关如何使用格式化第一100,000行,然后确定第一个新的文件,并进行将它们打印在该文件上,然后将第二批100,000打印到第二个新文件,依此类推。

Any help will be appreciated. 任何帮助将不胜感激。 :) :)

When you start conceiving batch processes, I think you should consider using a framework specialized in that. 当您开始构思批处理时,我认为您应该考虑使用专门的框架。 You may want to handle restarts, scheduling... Spring Batch is very good for that and already provides what you want: MultiResourceItemWriter that writes to multiple files with max lines per file and FlatFileItemReader to read data from a file. 你可能想要处理重启,调度...... Spring Batch非常适合你并且已经提供了你想要的东西: MultiResourceItemWriter写入多个文件,每个文件有最大行数, FlatFileItemReader用于从文件中读取数据。


In this case, what you want is to loop over each line of an input file and write a transformation of each line in multiple output files. 在这种情况下,您想要的是循环输入文件的每一行,并在多个输出文件中写入每行的转换。

One way to do that would be to create a Stream over the lines of the input file, map each line and send it to a custom writer. 一种方法是在输入文件的行上创建一个Stream,映射每一行并将其发送到自定义编写器。 This custom writer would implement the logic of switching writer when it has reached the maximum number of lines per file. 当自定义编写器达到每个文件的最大行数时,它将实现切换编写器的逻辑。

In the following code MyWriter opens a BufferedWriter to a file. 在以下代码中, MyWriterBufferedWriter打开到文件中。 When the maxLines is reached (a multiple of it), this writer is closed and another one is opened, incrementing currentFile . 当达到maxLines (它的倍数)时,将关闭此maxLines器并打开另一个,增加currentFile This way, it is transparent for the reader that we're writing to multiple files. 这样,我们写入多个文件对读者来说是透明的。

public static void main(String[] args) throws IOException {
    try (
        MyWriter writer = new MyWriter(10);
        Stream<String> lines = Files.lines(Paths.get("path/to/file"));
    ) {
        lines.map(l -> /* do transformation here */ l).forEach(writer::write);
    }
}

private static class MyWriter implements AutoCloseable {

    private long count = 0, currentFile = 1, maxLines = 0;
    private BufferedWriter bw = null;

    public MyWriter(long maxLines) {
        this.maxLines = maxLines;
    }

    public void write(String line) {
        try {
            if (count % maxLines == 0) {
                close();
                bw = Files.newBufferedWriter(Paths.get("path/to/newFiles/newFile" + currentFile++ + ".txt"));
            }
            bw.write(line);
            bw.newLine();
            count++;
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public void close() throws IOException {
        if (bw != null) bw.close();
    }
}

From what I understand in question. 据我所知,有问题。 A simple way can be: 一个简单的方法可以是:

BufferedReader buff = new BufferedReader(new FileReader(new File("H:\\Docs\\log.txt")));
Pair<Integer, BufferedWriter> ans = buff.lines().reduce(new Pair<Integer, BufferedWriter>(0, null), (count, line) -> {
    try {
        BufferedWriter w;
        if (count.getKey() % 1000 == 0) {
            if (count.getValue() != null) count.getValue().close();
            w = new BufferedWriter(new FileWriter(new File("f" + count.getKey() + ".txt")));
        } else w = count.getValue();
        w.write(line + "\n"); //do something
        return new Pair<>(count.getKey() + 1, w);
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}, (x, y) -> {
    throw new RuntimeException("Not supproted");
});
ans.getValue().close();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM