简体   繁体   中英

Java: Read 10,000 excel files and write to 1 master file in Java using Apache POI

I searched in Google but couldn't find a proper answer for my problem mentioned below. Pardon me if this is a duplicate question, but I couldn't find any proper answer.

So, coming to the question. I have to read multiple Excel files in Java and generate a final Excel report file out of these multiple files.

There are 2 folders:

  1. Source folder: It contains multiple Excel file (Probably 10,000 files)

  2. Destination folder: This folder will have one final Master Excel file after reading all the files from Source folder.

For each Excel file read from Source folder, the master file in the Destination folder will have 1 row each.

I am planning to use Apache POI to read and write excel files in Java.

I know its easy to read and write files in Java using POI, but my question is, given this scenario where there are almost 10,000 files to read and write into 1 single Master file, what will be the best approach to do that, considering the time taken and the CPU used by the program. Reading 1 file at a time will be too much time consuming.

So, I am planning to use threads to process files in batches of say 100 files at a time. Can anybody please point me to some resources or suggest me on how to proceed with this requirement?

Edited:

I have already written the program to read and write the file using POI. The code for the same is mentioned below:

        // Loop through the directory, fetching each file.
    File sourceDir = new File("SourceFolder");
    System.out.println("The current directory is = "+sourceDir);

    if(sourceDir.exists()) {
        if(sourceDir.isDirectory()){
            String[] filesInsideThisDir = sourceDir.list();
            numberOfFiles = filesInsideThisDir.length;
            for(String filename : filesInsideThisDir){
                System.out.println("(processFiles) The file name to read is = "+filename);

                // Read each file
                readExcelFile(filename);

                // Write the data
                writeMasterReport();
            }
        } else {
            System.out.println("(processFiles) Source directory specified is not a directory.");
        }
    } else {
    }

Here, the SourceFolder contains all the Excel files to read. I am looping through this folder fetching 1 file at a time, reading the contents and then writing to 1 Master Excel file.

The readExcelFile() method is reading every excel file, and creating a List which contains the data for each row to be written to Master excel file.

The writeMasterReport() method is writing the data read from every excel file.

The program is running fine. My question is, is there any way I can optimize this code by using Threads for reading through the files? I know that there is only 1 file to write to, and it cannot be done parallely. If the sourceFolder contains 10,000 files, reading and writing this way will take a lot of time to execute.

The size of each Input file will be around few hundred KB.

So, my question is, can we use Threads to read the files in batches, say 100 or 500 files per thread, and then write the data for each thread? I know the write part will need to be synchronized. This way at least the read and write time will be minimized. Please let me know your thoughts on this.

With 10k of files ~100Kb each we're talking about reading ca. ~1Gb of data. If the processing is not overly complex (seems so) then your bottleneck will be IO.

So it most probably does not make sense to parallelize reading and processing files as IO has an upper limit.
Parallelizing would have made sense if processing were complex/the bottleneck. It does not seem to be the case here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM