如何更快地处理大量的xml文件并更快地写入Java中的文本文件

Question

I have millions of xml file in a day. 我一天有数百万个xml文件。 Size of the xml files are from 10KB to 50 MB . xml文件的大小从10KB到50 MB。

I have written SAX parser to parse xml files and write into text file . 我已经编写了SAX解析器来解析xml文件并写入文本文件。 I am creating 35 unique text files from all millions of xml files . 我正在从数百万个xml文件中创建35个唯一的文本文件。 I have to parse these xml files in first come first basic manner so that order of the records is maintained . 我必须以先到先得的基本方式来解析这些xml文件，以便保持记录的顺序。

I have to process the files very quickly. 我必须非常快速地处理文件。

Total size of the xml files will be approx 1 TB. xml文件的总大小约为1 TB。 I have not implemented multi thread to process xml files because i have to process it in first come first basis. 我尚未实现处理XML文件的多线程，因为我必须先到先得来处理它。

How to process all xml files real quick? 如何快速处理所有xml文件？

Before moving my code into prod i just wanted to check whether i need to rethink about my implementation. 在将代码移入产品之前，我只想检查是否需要重新考虑实现。

This is how i read xml files and process it. 这就是我读取xml文件并进行处理的方式。

public static void main(String[] args) {
        File folder = new File("c://temp//SDIFILES");

        File[] files = folder.listFiles();

        Arrays.sort(files, new Comparator<Object>() {
            public int compare(Object o1, Object o2) {

                if (((File) o1).lastModified() > ((File) o2).lastModified()) {
                    return -1;
                } else if (((File) o1).lastModified() < ((File) o2).lastModified()) {
                    return +1;
                } else {
                    return 0;
                }
            }

        });

        for (File file : files) {
            System.out.println("Started Processing file :" + Arrays.asList(file));
            new MySaxParser(file);
        }

    }

I am not sure my processing will work for millions of the xml files. 我不确定我的处理是否可用于数百万个xml文件。

Answer 1

As you said,you have to process it in first come first basis . 如您所说，您必须先到先得来处理它。 You can think every xml file as a java method and then you can implement multi thread to process xml files.I think in this way you can save a lot of time . 您可以将每个xml文件都视为java方法，然后可以实现多线程处理xml文件。我认为这样可以节省很多时间。

Answer 2

Immediately: 立即：

return Long.compareTo(((File) o1).lastModified(), ((File) o2).lastModified());

read and write buffered 读写缓冲
be careful of String operations 注意字符串操作
no validation 没有验证
for DTDs use XML catalogs 对于DTD，请使用XML目录
use a profiler! 使用探查器！ (Saved me in Excel generation) （在Excel生成中保存了我）
if possible use a database instead of 35 output files 如果可能，使用数据库而不是35个输出文件
check for a RAM disk or such 检查RAM磁盘等
of course much memory -Xmx 当然，内存很大-Xmx

The last resort, an XML pull parser (StaX) io Xalan/Xerces or plain text parsing, is what you try to prevent; 您最后要避免的是XML解析器（StaX）或Xalan / Xerces或纯文本解析。 so no comment on that. 所以对此没有评论。

Arrays.sort(files, new Comparator<File>() {
        @Override
        public int compare(File o1, File o2) {
            return Long.compareTo(o1.lastModified(), o2.lastModified());
        }
    });

Answer 3

There are number of things to consider... 有很多事情要考虑...

Is it a batch process when all files already there in c://temp//SDIFILES folder or is it a kind of event listener which is waiting for a next file to appear there? 当所有文件都已经存在于c://temp//SDIFILES文件夹中时，这是一个批处理过程，还是一种等待下一个文件出现在其中的事件侦听器？
do you have XSD schemas for all those XMLs? 所有这些XML都有XSD模式吗？ If so you may think about to use JAXB unmarshaller upfront instead of custom SAX parser 如果是这样，您可能会考虑预先使用JAXB解组器而不是自定义SAX解析器

IMHO from first glance... 恕我直言...

If it is a batch process - Separate parsing process from combining results into text files. 如果是批处理-将结果组合到文本文件中的单独解析过程。 Then you can apply multi-threading to parse files by using some temp/stage files/objects before put them into text files. 然后，您可以通过使用一些临时/阶段文件/对象将多线程分析应用于文件，然后再将其放入文本文件。 ie 即
- run as many parsing threads as your resources allow (memory/cpu) 在资源允许的情况下运行尽可能多的解析线程（内存/ CPU）
- place each parser result aside into temp file/DB/In memory Map etc. whith its order number or timestamp 将每个解析器结果放到临时文件/ DB /内存映射等中，其顺序号或时间戳记
- combine ready results into text files as last step of the whole process. 将准备好的结果合并到文本文件中，作为整个过程的最后一步。 So you will not wait to parse next XML file only when previous parsed. 因此，您不会等待仅在先前的解析时解析下一个XML文件。
if it is a listener it also can use multithreading to parse, but little more may needed. 如果它是一个侦听器，它也可以使用多线程进行解析，但可能仅需要一点。 As example, spin up combining results into text files periodically (as example every 10 sec.) which will pick temp result files marked as ready 例如，周期性地（例如，每10秒）将结果组合成文本文件，这将选择标记为就绪的临时结果文件

For both cases anyway it will be "portioned process". 无论如何，这都是“部分过程”。 Let say you can run 5 parsing threads for the next 5 files from sorted by timestamp list of files, then wait until all 5 parsing threads completed (result may not be necessary a temp file, but can stay in memory if possible), then combine into text file. 假设您可以为按时间戳文件列表排序的下5个文件运行5个解析线程，然后等待所有5个解析线程完成（结果可能不是临时文件，但可以保留在内存中），然后合并进入文本文件。 ... then pick next 5 files and so on... ...然后选择下5个文件，依此类推...

... something like that... 像那样的东西

Definitely, sequential process that large number of files will take a time, and mostly to parse them from XML. 肯定地，顺序处理需要花费大量时间，而且大部分时间都是从XML解析它们。

如何更快地处理大量的xml文件并更快地写入Java中的文本文件

问题描述

3 个解决方案

解决方案1
0 2017-10-25 15:41:26

解决方案2
0 2017-10-25 15:41:47

解决方案3
0 2017-10-25 16:09:42

如何更快地处理大量的xml文件并更快地写入Java中的文本文件

问题描述

3 个解决方案

解决方案1 0 2017-10-25 15:41:26

解决方案2 0 2017-10-25 15:41:47

解决方案3 0 2017-10-25 16:09:42

解决方案1
0 2017-10-25 15:41:26

解决方案2
0 2017-10-25 15:41:47

解决方案3
0 2017-10-25 16:09:42