简体   繁体   English

Java | XML按大小拆分| HashMap性能问题| OOM堆空间错误

[英]Java | XML Split by Size | HashMap Performance Issue | OOM Heap Space Error

The requirement is to split XML documents larger than 5 MB into smaller chunks of documents so as to support the target system to accept and process it/them. 要求是将大于5 MB的XML文档拆分为较小的文档块,以支持目标系统接受和处理它。 Because XSLT v2 doesnt seem to support XML document split by size, we ended up writing a java program. 因为XSLT v2似乎不支持按大小分割的XML文档,所以我们最终编写了一个Java程序。 The program works well when the document is small or less than 10 MB. 当文档较小或小于10 MB时,该程序运行良好。 The program just fails when a 32 MB file is fed. 馈入32 MB文件时,程序只会失败。 The program works as an agent and is plugged in to JVM whose maximum memory is set to 25GB. 该程序用作代理,并插入最大内存设置为25GB的JVM中。 Despite this, we persistently see OOM heap space error. 尽管如此,我们仍然持续看到OOM堆空间错误。 Generating heap dump file reveals the following as problem suspect 1: 生成堆转储文件显示以下问题可疑对象1:

sun.misc.Launcher$AppClassLoader @ 0x1bb7ae098" occupies 156,512,240 (64.62%) bytes. The memory is accumulated in one instance of 

Based on this, I began inspecting the program and deduced a spot that could potentially induce the memory issue and that is [you may disregard few of the sysout's as they were added for my debugging session]: 基于此,我开始检查程序并推断出一个可能引起内存问题的位置,即[您可以忽略为调试会话添加的少数sysout]:

public static HashMap < Integer, String > splitPromotionItem(List promotionsItems, int promotionItemMaxSizeUoMNumericValue, int promotionItemMaxSize, String routingLocation, String docNum, XDNode messageHeader, XDNode promotionsData){
    HashMap < Integer, String > promotionItemMap = new HashMap < Integer, String > ();
    int totalSubMessage = 1;
    String promotionsItemsData = "";
    int promotionsItemsSize = 0;
    String promotionsItemsDataTemp = "";
    int i = 0;
    int q = 1;
    do {
        promotionsItemsSize = promotionsItemsSize + ((XDNode) promotionsItems.get(i)).flatten().getBytes().length;
        promotionsItemsData = promotionsItemsData + ((XDNode) promotionsItems.get(i)).flatten();

        if (promotionsItemsSize > (promotionItemMaxSize * 1024 * 1024)) {
            System.out.println("Inside First If: " + promotionsItems.size() + ": " + q++);
            promotionsItemsSize = promotionsItemsSize - ((XDNode) promotionsItems.get(i)).flatten().getBytes().length;
            promotionsItemsData = promotionsItemsDataTemp;
            promotionItemMap.put(totalSubMessage++, promotionsItemsData);
            if (i != (promotionsItems.size() - 1)) {
                System.out.println("Inside Second If: " + promotionsItems.size());
                i--;
                promotionsItemsSize = 0;
                promotionsItemsData = "";
            } else {
                System.out.println("Inside Second Else: " + promotionsItems.size());
                promotionsItemsSize = ((XDNode) promotionsItems.get(i)).flatten().getBytes().length;
                promotionsItemsData = ((XDNode) promotionsItems.get(i)).flatten();
            }
        }
        if (promotionsItemsSize < (promotionItemMaxSize * 1024 * 1024) && (i) == (promotionsItems.size() - 1)) {
            promotionItemMap.put(totalSubMessage++, promotionsItemsData);
        }
        i++;
        promotionsItemsDataTemp = promotionsItemsData;
    } while (i < promotionsItems.size());

    return promotionItemMap;
}

The program appears to first split the large XML document into smaller chunks that are stored in a HashMap which later is fed to a function that iterates through each entry in the map and writes to a file. 该程序似乎首先将大型XML文档拆分为较小的块,然后将其存储在HashMap中,然后将其提供给一个函数,该函数遍历映射中的每个条目并写入文件。 The name of the file and one of the elements inside the bears the index of the file in the split batch and the total split count for easy recognition. 文件名和其中的一个元素之一带有文件在拆分批处理中的索引以及拆分总数,以便于识别。

My initial thoughts were to revise the code to this: Instead of collecting the smaller XML chunks into HashMap, write them to a file directly. 我最初的想法是将代码修改为:不要将较小的XML块收集到HashMap中,而是将它们直接写入文件中。 This also requires that after all smaller chunks are saved to disk, I must reopen them to update its content for the file index and total count to reflect and the name of the file itself. 这还要求将所有较小的块保存到磁盘后,我必须重新打开它们以更新其内容,以反映文件索引和总数以反映文件本身的名称。

Are there any better way of handling this? 有没有更好的方法来解决这个问题? Please help. 请帮忙。

Note: the JVM handles high volume of data every day and bears the following start-up options and we use saxon as xslt processer: 注意:JVM每天处理大量数据,并具有以下启动选项,我们将saxon用作xslt处理器:

-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl -Xmx15360M -Xrs -XX:GCTimeRatio=5 -XX:+PrintGCDetails -Xloggc:<location> -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=60

Update 29112017 更新29112017

The use of classes XDNode and its function flatten are a result of extending the program with an API offered by iWay so as to be able to plug-in the agent into its JVM for seamless execution of process flows. XDNode类的使用及其功能扁平化是通过使用iWay提供的API扩展程序的结果,以便能够将代理程序插入其JVM中以无缝执行流程。 Here is the official definition of XDNode: 这是XDNode的正式定义:

An XDNode is a single element of an XML tree. XDNode是XML树的单个元素。 A complete document is a tree of XDNodes. 完整的文档是XDNode的树。 The XDNode class and tree are designed for fast parsing and searching, and for easy manipulation in an application. XDNode类和树旨在用于快速解析和搜索,并易于在应用程序中进行操作。 Methods are available to convert between XDNode trees and standard JDOM trees. 可使用方法在XDNode树和标准JDOM树之间进行转换。 All server operations are performed on trees of XDNodes. 所有服务器操作均在XDNode的树上执行。

The function flatten() returns the entire XML document as String. 函数flatten()以String形式返回整个XML文档。

Here is an example of how the XML document would look like: 这是一个XML文档的示例:

XML文档样本

The split operation is performed at the element /SalonApps/Promotion/PromotionData/PromotionItem. 拆分操作在元素/ SalonApps / Promotion / PromotionData / PromotionItem中执行。 We iterate through each occurrence of PromotionItem and store the iterated chunk in a temp variable as seen in the code above. 我们遍历每次出现的PromotionItem,并将迭代后的块存储在temp变量中,如上面的代码所示。 We also check for the size to be more than the limit which is 5 MB [defined at the beginning of the class] at the beginning of each iteration to decide the need for performing a packaging and file-write operation. 我们还会在每次迭代的开始时检查大小是否超过5 MB(在类的开头定义)的限制,以决定是否需要执行打包和文件写入操作。 When the size is less, the iteration progresses further to collect and store. 当大小较小时,迭代将进一步进行以进行收集和存储。 The header section [/SalonApps/Promotion/MessageHeader] of the document is added to each split document with the value of the MessageID modified to reflect the index of the split message in batch and the total count of batch at locations 2nd and 3rd when the value is delimited by a hyphen. 将文档的标题部分[/ SalonApps / Promotion / MessageHeader]添加到每个拆分文档中,并修改MessageID的值以反映拆分邮件的批处理索引以及位置2和3处的批处理总数。值由连字符分隔。

We support XSLT v1 and v2 only. 我们仅支持XSLT v1和v2。 If XSLT v1 or v2 can be used to split XML documents by its size, that would be great. 如果可以使用XSLT v1或v2按其大小拆分XML文档,那就太好了。

I find it very hard to understand exactly what you are trying to do, it's certainly very difficult to gain any insight by reverse-engineering your sample code. 我发现很难确切地了解您要做什么,通过对示例代码进行反向工程来获取任何见识当然非常困难。 But you've expressed interest in an XSLT solution, so here's a suggestion. 但是您已经表达了对XSLT解决方案的兴趣,所以这里有个建议。

If your document is essentially a flat structure of the form: 如果您的文档本质上是以下形式的平面结构:

<table>
  <record>...</record>
  <record>...</record>
  ...
</table>

and if the number of records is a reasonable proxy for document size, then you can easily split it into fragments each of which has a maximum size of N (records) using 并且如果记录的数量是文档大小的合理替代,那么您可以使用以下方法轻松地将其拆分为片段,每个片段的最大大小为N(记录)

<xsl:template match="table">
   <xsl:for-each-group select="record" group-adjacent="(position()-1) idiv $N">
     <xsl:result-document href="part{position()}">
       <table>
         <xsl:copy-of select="current-group()"/>
       </table>
     </xsl:result-document>
  </xsl:for-each-group>
</xsl:template>

Note also that this solution is streamable if you use XSLT 3.0 (though streaming shouldn't be necessary until you start handling 200Mb or more). 还请注意,如果您使用XSLT 3.0,则该解决方案是可流式传输的(尽管直到开始处理200Mb或更高流量时,流式传输才是必需的)。

If that's NOT what you are trying to do, then you need to explain your requirements more clearly. 如果这不是您要尝试的操作,则需要更清楚地说明您的要求。

The basic cause of your problem is probably this: 您的问题的根本原因可能是:

promotionsItemsData = 
   promotionsItemsData + ((XDNode) promotionsItems.get(i)).flatten();

where you are building large strings within a loop by incremental string concatenation. 通过增量字符串串联在循环中构建大型字符串的位置。 That's very bad news in Java; 这对Java来说是个坏消息。 you should be building the string with a StringBuilder . 您应该使用StringBuilder构建字符串。

That should probably be enough to fix the problem, though I would personally tackle the problem in a completely different way. 尽管我本人将以完全不同的方式解决该问题,但这可能足以解决该问题。 I would decide where to split the file based on some metric applied to the tree-view of the document, and having selected which nodes to put in each output part, serialize them in the regular way, rather than serializing nodes and measuring the size of the serialized parts. 我将根据应用于文档树状视图的某种度量标准来决定将文件拆分到何处,并选择在每个输出部分放置哪些节点,以常规方式对其进行序列化,而不是序列化节点并测量节点的大小。序列化的部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM