简体   繁体   English

Java:使用Apache POI读取10,000个excel文件并以Java写入1个主文件

[英]Java: Read 10,000 excel files and write to 1 master file in Java using Apache POI

I searched in Google but couldn't find a proper answer for my problem mentioned below. 我在Google中搜索过,但找不到下面提到的问题的正确答案。 Pardon me if this is a duplicate question, but I couldn't find any proper answer. 如果这是一个重复的问题,请原谅我,但我找不到任何适当的答案。

So, coming to the question. 因此,提出了这个问题。 I have to read multiple Excel files in Java and generate a final Excel report file out of these multiple files. 我必须读取Java中的多个Excel文件并从这些文件中生成最终的Excel报告文件。

There are 2 folders: 有2个文件夹:

  1. Source folder: It contains multiple Excel file (Probably 10,000 files) 源文件夹:它包含多个Excel文件(大约10,000个文件)

  2. Destination folder: This folder will have one final Master Excel file after reading all the files from Source folder. 目标文件夹:从源文件夹中读取所有文件后,此文件夹将具有一个最终的Master Excel文件。

For each Excel file read from Source folder, the master file in the Destination folder will have 1 row each. 对于从Source文件夹读取的每个Excel文件,Destination文件夹中的主文件各有1行。

I am planning to use Apache POI to read and write excel files in Java. 我打算使用Apache POI在Java中读写Excel文件。

I know its easy to read and write files in Java using POI, but my question is, given this scenario where there are almost 10,000 files to read and write into 1 single Master file, what will be the best approach to do that, considering the time taken and the CPU used by the program. 我知道使用POI可以很容易地用Java读写文件,但是我的问题是,考虑到这种情况,将近10,000个文件可以读写到一个主文件中,考虑到这种情况,什么是最好的方法?时间和程序使用的CPU。 Reading 1 file at a time will be too much time consuming. 一次读取1个文件将非常耗时。

So, I am planning to use threads to process files in batches of say 100 files at a time. 因此,我计划使用线程来一次处理一批文件,例如100个文件。 Can anybody please point me to some resources or suggest me on how to proceed with this requirement? 有人可以向我介绍一些资源或建议我如何继续执行此要求吗?

Edited: 编辑:

I have already written the program to read and write the file using POI. 我已经编写了使用POI读写文件的程序。 The code for the same is mentioned below: 相同的代码如下所述:

        // Loop through the directory, fetching each file.
    File sourceDir = new File("SourceFolder");
    System.out.println("The current directory is = "+sourceDir);

    if(sourceDir.exists()) {
        if(sourceDir.isDirectory()){
            String[] filesInsideThisDir = sourceDir.list();
            numberOfFiles = filesInsideThisDir.length;
            for(String filename : filesInsideThisDir){
                System.out.println("(processFiles) The file name to read is = "+filename);

                // Read each file
                readExcelFile(filename);

                // Write the data
                writeMasterReport();
            }
        } else {
            System.out.println("(processFiles) Source directory specified is not a directory.");
        }
    } else {
    }

Here, the SourceFolder contains all the Excel files to read. 在此,SourceFolder包含所有要读取的Excel文件。 I am looping through this folder fetching 1 file at a time, reading the contents and then writing to 1 Master Excel file. 我遍历此文件夹,一次获取1个文件,读取内容,然后写入1个主Excel文件。

The readExcelFile() method is reading every excel file, and creating a List which contains the data for each row to be written to Master excel file. readExcelFile()方法正在读取每个excel文件,并创建一个List,其中包含要写入Master excel文件的每一行的数据。

The writeMasterReport() method is writing the data read from every excel file. writeMasterReport()方法正在写入从每个excel文件读取的数据。

The program is running fine. 该程序运行正常。 My question is, is there any way I can optimize this code by using Threads for reading through the files? 我的问题是,有什么办法可以通过使用线程来读取文件来优化此代码? I know that there is only 1 file to write to, and it cannot be done parallely. 我知道只有1个文件要写入,并且不能并行执行。 If the sourceFolder contains 10,000 files, reading and writing this way will take a lot of time to execute. 如果sourceFolder包含10,000个文件,则以这种方式读取和写入将花费大量时间来执行。

The size of each Input file will be around few hundred KB. 每个输入文件的大小约为数百KB。

So, my question is, can we use Threads to read the files in batches, say 100 or 500 files per thread, and then write the data for each thread? 因此,我的问题是,我们可以使用线程批量读取文件,例如每个线程100或500个文件,然后为每个线程写入数据吗? I know the write part will need to be synchronized. 我知道写入部分将需要同步。 This way at least the read and write time will be minimized. 这样,至少读写时间将被最小化。 Please let me know your thoughts on this. 请让我知道您对此的想法。

With 10k of files ~100Kb each we're talking about reading ca. 大约有10k的文件〜100Kb,我们正在谈论读取ca。 ~1Gb of data. 约1Gb的数据。 If the processing is not overly complex (seems so) then your bottleneck will be IO. 如果处理不是太复杂(似乎如此),那么您的瓶颈将是IO。

So it most probably does not make sense to parallelize reading and processing files as IO has an upper limit. 因此,由于IO有上限,因此并行读取和处理文件很可能没有意义。
Parallelizing would have made sense if processing were complex/the bottleneck. 如果处理很复杂/存在瓶颈,那么并行化将是有意义的。 It does not seem to be the case here. 这里似乎并非如此。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM