简体   繁体   English

如何提高分割xml文件的性能

[英]How to improve splitting xml file performance

I've see quite a lot posts/blogs/articles about splitting XML file into a smaller chunks and decided to create my own because I have some custom requirements. 我看到很多关于将XML文件拆分成较小块的帖子/博客/文章,并决定自己创建,因为我有一些自定义要求。 Here is what I mean, consider the following XML : 这就是我的意思,请考虑以下XML:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?> 
<company>
 <staff id="1">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="2">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="3">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="4">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="5">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <salary>100000</salary>
   </staff>
</company>

I want to split this xml into n parts, each containing 1 file, but the staff element must contain nickname , if it's not there I don't want it. 我想将这个xml分成n个部分,每个部分包含1个文件,但是staff元素必须包含nickname ,如果它不在那里我不想要它。 So this should produce 4 xml splits, each containing staff id starting at 1 until 4. 因此,这应该产生4 xml拆分,每个拆分包含从1到4开始的员工ID。

Here is my code : 这是我的代码:

public int split() throws Exception{
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputFilePath)));

        String line;
        List<String> tempList = null;

        while((line=br.readLine())!=null){
            if(line.contains("<?xml version=\"1.0\"") || line.contains("<" + rootElement + ">") || line.contains("</" + rootElement + ">")){
                continue;
            }

            if(line.contains("<"+ element +">")){
                tempList = new ArrayList<String>();
            }
            tempList.add(line);

            if(line.contains("</"+ element +">")){
                if(hasConditions(tempList)){
                    writeToSplitFile(tempList);
                    writtenObjectCounter++;
                    totalCounter++;
                }
            }

            if(writtenObjectCounter == itemsPerFile){
                writtenObjectCounter = 0;
                fileCounter++;          
                tempList.clear();
            }
        }

        if(tempList.size() != 0){
        writeClosingRootElement();
        }

        return totalCounter;
    }

    private void writeToSplitFile(List<String> itemList) throws Exception{
        BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
        if(writtenObjectCounter == 0){
        wr.write("<" + rootElement + ">");
        wr.write("\n");
        }

        for (String string : itemList) {
            wr.write(string);
            wr.write("\n");
        }

        if(writtenObjectCounter == itemsPerFile-1)
        wr.write("</" + rootElement + ">");
        wr.close();
    }

    private void writeClosingRootElement() throws Exception{
        BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
        wr.write("</" + rootElement + ">");
        wr.close();
    }

    private boolean hasConditions(List<String> list){
        int matchList = 0;

        for (String condition : conditionList) {
            for (String string : list) {
                if(string.contains(condition)){
                    matchList++;
                }
            }
        }

        if(matchList >= conditionList.size()){
            return true;
        }

        return false;
    }

I know that opening/closing stream for each written staff element which does impact the performance. 我知道每个书面staff元素的开/关流确实影响了性能。 But if I write once per file(which may contain n number of staff ). 但如果我每个文件写一次(可能包含n个staff )。 Naturally root and split elements are configurable. 自然根和拆分元素是可配置的。

Any ideas how can I improve the performance/logic? 任何想法如何改善性能/逻辑? I'd prefer some code, but good advice can be better sometimes 我更喜欢一些代码,但有时候好的建议会更好

Edit: 编辑:

This XML example is actually a dummy example, the real XML which I'm trying to split is about 300-500 different elements under split element all appearing at the random order and number varies. 这个XML示例实际上是一个虚拟示例,我正在尝试拆分的真正的XML是大约300-500个不同的元素,它们在随机顺序下出现,并且数量各不相同。 Stax may not be the best solution after all? Stax可能不是最好的解决方案吗?

Bounty update : 赏金更新:

I'm looking for a solution(code) that will: 我正在寻找一个解决方案(代码),它将:

  • Be able to split XML file into n parts with x split elements(from the dummy XML example staff is the split element). 能够使用x split元素将XML文件拆分为n个部分(来自虚拟XML示例人员是拆分元素)。

  • The content of the spitted files should be wrapped in the root element from the original file(like in the dummy example company) spitted文件的内容应该包装在原始文件的根元素中(就像在虚拟示例公司中一样)

  • I'd like to be able to specify condition that must be in the split element ie I want only staff which have nickname, I want to discard those without nicknames. 我希望能够指定必须在split元素中的条件,即我只想要有昵称的工作人员,我想丢弃那些没有昵称的人。 But be able to also split without conditions while running split without conditions. 但是在没有条件的情况下运行拆分时也能够无条件地拆分。

  • The code doesn't necessarily have to improve my solution(lacking good logic and performance), but it works. 代码不一定要改进我的解决方案(缺乏良好的逻辑和性能),但它的工作原理。

And not happy with "but it works". 并不满意“但它有效”。 And I can't find enough examples of Stax for these kind of operations, user community is not great as well. 而且我找不到足够的Stax用于这类操作的例子,用户社区也不是很好。 It doesn't have to be Stax solution as well. 它也不一定是Stax解决方案。

I'm probably asking too much, but I'm here to learn stuff, giving good bounty for the solution I think. 我可能要求太多,但我在这里学习东西,为我认为的解决方案提供了很好的赏金。

First piece of advice: don't try to write your own XML handling code . 第一条建议: 不要尝试编写自己的XML处理代码 Use an XML parser - it's going to be much more reliable and quite possibly faster. 使用XML解析器-这将是更为可靠,很可能更快。

If you use an XML pull parser (eg StAX ) you should be able to read an element at a time and write it out to disk, never reading the whole document in one go. 如果您使用XML pull解析器(例如StAX ),您应该能够一次读取一个元素并将其写入磁盘,而不是一次性读取整个文档。

Here's my suggestion. 这是我的建议。 It requires a streaming XSLT 3.0 processor: which means in practice that it needs Saxon-EE 9.3. 它需要一个流式XSLT 3.0处理器:这在实践中意味着它需要Saxon-EE 9.3。

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">

<xsl:mode streamable="yes">

<xsl:template match="/">
  <xsl:apply-templates select="company/staff"/>
</xsl:template>

<xsl:template match=staff">
  <xsl:variable name="v" as="element(staff)">
    <xsl:copy-of select="."/>
  </xsl:variable>
  <xsl:if test="$v/nickname">
    <xsl:result-document href="{@id}.xml">
      <xsl:copy-of select="$v"/>
    </xsl:result-document>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>

In practice, though, unless you have hundreds of megabytes of data, I suspect a non-streaming solution will be quite fast enough, and probably faster than your hand-written Java code, given that your Java code is nothing to get excited about. 但实际上,除非你有数百兆字节的数据,否则我怀疑非流媒体解决方案将足够快,并且可能比你手工编写的Java代码更快,因为你的Java代码没什么好兴奋的。 At any rate, give an XSLT solution a try before you write reams of low-level Java. 无论如何,在编写大量低级Java之前,先尝试一下XSLT解决方案。 It's a routine problem, after all. 毕竟,这是一个常规问题。

You could do the following with StAX: 您可以使用StAX执行以下操作:

Algorithm 算法

  1. Read and hold onto the root element event. 阅读并保留根元素事件。
  2. Read first chunk of XML: 阅读第一块XML:
    1. Queue events until condition has been met. 队列事件直到满足条件。
    2. If condition has been met: 如果满足条件:
      1. Write start document event. 写开始文档事件。
      2. Write out root start element event 写出根启动元素事件
      3. Write out split start element event 写出拆分开始元素事件
      4. Write out queued events 写出排队的事件
      5. Write out remaining events for this section. 写出此部分的剩余事件。
    3. If condition was not met then do nothing. 如果不满足条件则不采取任何措施。
  3. Repeat step 2 with next chunk of XML 使用下一个XML块重复步骤2

Code for Your Use Case 用例代码

The following code uses StAX APIs to break up the document as outlined in your question: 以下代码使用StAX API来分解您的问题中概述的文档:

package forum7408938;

import java.io.*;
import java.util.*;

import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        Demo demo = new Demo();
        demo.split("src/forum7408938/input.xml", "nickname");
        //demo.split("src/forum7408938/input.xml", null);
    }

    private void split(String xmlResource, String condition) throws Exception {
        XMLEventFactory xef = XMLEventFactory.newFactory();
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
        StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
        StartDocument startDocument = xef.createStartDocument();
        EndDocument endDocument = xef.createEndDocument();

        XMLOutputFactory xof = XMLOutputFactory.newFactory();
        while(xer.hasNext() && !xer.peek().isEndDocument()) {
            boolean metCondition;
            XMLEvent xmlEvent = xer.nextTag();
            if(!xmlEvent.isStartElement()) {
                break;
            }
            // BOUNTY CRITERIA
            // Be able to split XML file into n parts with x split elements(from
            // the dummy XML example staff is the split element).
            StartElement breakStartElement = xmlEvent.asStartElement();
            List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();

            // BOUNTY CRITERIA
            // I'd like to be able to specify condition that must be in the 
            // split element i.e. I want only staff which have nickname, I want 
            // to discard those without nicknames. But be able to also split 
            // without conditions while running split without conditions.
            if(null == condition) {
                cachedXMLEvents.add(breakStartElement);
                metCondition = true;
            } else {
                cachedXMLEvents.add(breakStartElement);
                xmlEvent = xer.nextEvent();
                metCondition = false;
                while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
                    cachedXMLEvents.add(xmlEvent);
                    if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
                        metCondition = true;
                        break;
                    }
                    xmlEvent = xer.nextEvent();
                }
            }

            if(metCondition) {
                // Create a file for the fragment, the name is derived from the value of the id attribute
                FileWriter fileWriter = null;
                fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");

                // A StAX XMLEventWriter will be used to write the XML fragment
                XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
                xew.add(startDocument);

                // BOUNTY CRITERIA
                // The content of the spitted files should be wrapped in the 
                // root element from the original file(like in the dummy example
                // company)
                xew.add(rootStartElement);

                // Write the XMLEvents that were cached while when we were
                // checking the fragment to see if it matched our criteria.
                for(XMLEvent cachedEvent : cachedXMLEvents) {
                    xew.add(cachedEvent);
                }

                // Write the XMLEvents that we still need to parse from this
                // fragment
                xmlEvent = xer.nextEvent();
                while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
                    xew.add(xmlEvent);
                    xmlEvent = xer.nextEvent();
                }
                xew.add(xmlEvent);

                // Close everything we opened
                xew.add(xef.createEndElement(rootStartElement.getName(), null));
                xew.add(endDocument);
                fileWriter.close();
            }
        }
    }

}

@Jon Skeet is spot on as usual in his advice. @Jon Skeet在他的建议中照常出现。 @Blaise Doughan gave you a very basic picture of using StAX (which would be my preferred choice, although you can do basically the same thing with SAX). @Blaise Doughan给了你一个使用StAX的基本图片(这是我的首选,尽管你可以用SAX做同样的事情)。 You seem to be looking for something more explicit, so here's some pseudo code to get you started (based on StAX): 你似乎在寻找更明确的东西,所以这里有一些伪代码可以让你入门(基于StAX):

  1. find first "staff" StartElement 找到第一个“员工”StartElement
  2. set a flag indicating you are in a "staff" element and start tracking the depth (StartElement is +1, EndElement is -1) 设置一个标志,指示您处于“staff”元素并开始跟踪深度(StartElement为+1,EndElement为-1​​)
  3. now, process the "staff" sub-elements, grab any of the data you care about and put it in a file (or where ever) 现在,处理“staff”子元素,抓取你关心的任何数据并将其放入文件中(或任何地方)
  4. keep processing until your depth reaches 0 (when you find the matching "staff" EndElement) 继续处理直到你的深度达到0(当你找到匹配的“staff”EndElement时)
  5. unset the flag indicating you are in a "staff" element 取消设置表示您处于“staff”元素的标志
  6. search for the next "staff" StartElement 搜索下一个“staff”StartElement
  7. if found, go to 2. and repeat 如果找到,请转到2.然后重复
  8. if not found, document is complete 如果没有找到,文件就完整了

EDIT: 编辑:

wow, i have to say i'm amazed at the number of people willing to do someone else's work for them. 哇,我不得不说我很惊讶那些愿意为他们做别人工作的人。 i didn't realize SO was basically a free version of rent-a-coder. 我没有意识到SO基本上是一个免费版本的租赁编码器。

@Gandalf StormCrow: Let me divide your problem into three separate issues:- i) Reading XML and simultaenous split XML in best possible way @Gandalf StormCrow:让我将你的问题分成三个独立的问题: - i)以最佳方式阅读XML和同步拆分XML

ii) Checking condition in split file ii)检查分割文件中的条件

iii) If condition met, process that spilt file. iii)如果满足条件,则处理溢出的文件。

for i), there are ofcourse mutliple solutions: SAX, STAX and other parsers and as simple as that as you mentioned just read using simple java io operations and search for tags. 对于i),有多种解决方案:SAX,STAX和其他解析器,就像你提到的那样简单,只需使用简单的java io操作读取并搜索标签。

I believe SAX/STAX/simple java IO, anything will do. 我相信SAX / STAX /简单的java IO,什么都行。 I have taken your example as base for my solution. 我把你的例子作为我的解决方案的基础。

ii) Checking condition in split file: you have used contains() method to check for existence of nickname. ii)检查拆分文件中的条件:您已使用contains()方法检查是否存在昵称。 This does not seem best way: what if your conditions are as complex as if nickname should be present but length>5 or salary should be numeric etc. 这似乎不是最好的方法:如果你的条件如同昵称应该存在但长度> 5或工资应该是数字等那么复杂。

I would use new java XML validation framework for this which make uses of XML schema.Please note we can cache schema object in memory so to reuse it again and again. 我将使用新的java XML验证框架来实现XML模式的使用。请注意我们可以在内存中缓存模式对象,以便一次又一次地重用它。 This new validation framework is pretty fast. 这个新的验证框架非常快。

iii) If condition met, process that spilt file. iii)如果满足条件,则处理溢出的文件。 You may want use java concurrent APIs to submit async tasks(ExecutorService class) to acheive parallel execution for faster performance. 您可能希望使用java并发API来提交异步任务(ExecutorService类)以实现并行执行以获得更快的性能。

So considering above points, one possible solution can be:- 因此,考虑到以上几点,一个可能的解决方案是: -

You can create a company.xsd file like:- 您可以创建一个company.xsd文件,如: -

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
    targetNamespace="http://www.example.org/NewXMLSchema"
    xmlns:tns="http://www.example.org/NewXMLSchema"
    elementFormDefault="unqualified">
    <element name="company">
        <complexType>
        <sequence>
            <element name="staff" type="tns:stafftype"/>
            </sequence>
        </complexType>

    </element>

    <complexType name="stafftype">
        <sequence>
        <element name="firstname" type="string" minOccurs="0" />
        <element name="lastname" type="string" minOccurs="0" />
        <element name="nickname" type="string" minOccurs="1" />
        <element name="salary" type="int" minOccurs="0" />
        </sequence>

    </complexType>

</schema>

then your java code would look like:- 然后你的java代码看起来像: -

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.IOException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;

import org.xml.sax.SAXException;

public class testXML {
    //  Lookup a factory for the W3C XML Schema language
    static SchemaFactory factory = SchemaFactory
            .newInstance("http://www.w3.org/2001/XMLSchema");

    //  Compile the schema. 
    static File schemaLocation = new File("company.xsd");
    static Schema schema = null;
    static {
        try {
            schema = factory.newSchema(schemaLocation);
        } catch (SAXException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private final ExecutorService pool = Executors.newFixedThreadPool(20);;

    boolean validate(StringBuffer splitBuffer) {
        boolean isValid = false;
        Validator validator = schema.newValidator();
        try {
            validator.validate(new StreamSource(new ByteArrayInputStream(
                    splitBuffer.toString().getBytes())));
            isValid = true;
        } catch (SAXException ex) {
            System.out.println(ex.getMessage());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return isValid;

    }

    void split(BufferedReader br, String rootElementName,
            String splitElementName) {
        StringBuffer splitBuffer = null;
        String line = null;
        String startRootElement = "<" + rootElementName + ">";
        String endRootElement = "</" + rootElementName + ">";

        String startSplitElement = "<" + splitElementName + ">";
        String endSplitElement = "</" + splitElementName + ">";
        String xmlDeclaration = "<?xml version=\"1.0\"";
        boolean startFlag = false, endflag = false;
        try {
            while ((line = br.readLine()) != null) {
                if (line.contains(xmlDeclaration)
                        || line.contains(startRootElement)
                        || line.contains(endRootElement)) {
                    continue;
                }

                if (line.contains(startSplitElement)) {
                    startFlag = true;
                    endflag = false;
                    splitBuffer = new StringBuffer(startRootElement);
                    splitBuffer.append(line);

                } else if (line.contains(endSplitElement)) {
                    endflag = true;
                    startFlag = false;
                    splitBuffer.append(line);
                    splitBuffer.append(endRootElement);

                } else if (startFlag) {
                    splitBuffer.append(line);
                }

                if (endflag) {
                    //process splitBuffer
                    boolean result = validate(splitBuffer);
                    if (result) {
                        //send it to a thread for processing further
                        //it is async so that main thread can continue for next

                        pool.submit(new ProcessingHandler(splitBuffer));

                    }
                }

            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }
}

class ProcessingHandler implements Runnable {
    String splitXML = null;

    ProcessingHandler(StringBuffer splitXMLBuffer) {
        this.splitXML = splitXMLBuffer.toString();
    }

    @Override
    public void run() {
        // do like writing to a file etc.

    }

}

Have a look at this. 看看这个。 This is slightly reworked sample from xmlpull.org: 这是来自xmlpull.org的略有改进的示例:

http://www.xmlpull.org/v1/download/unpacked/doc/quick_intro.html http://www.xmlpull.org/v1/download/unpacked/doc/quick_intro.html

The following should do all you need unless you have nested splitting tags like: 除非你有嵌套的分裂标签,否则你应该做的就是:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
    <staff id="1">
        <firstname>yong</firstname>
        <lastname>mook kim</lastname>
        <nickname>mkyong</nickname>
        <salary>100000</salary>
        <other>
            <staff>
            ...
            </staff>
        </other>
    </staff>
</company>

To run it in pass-through mode simply pass null as splitting tag. 要以直通模式运行它,只需将null作为拆分标记传递。

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.xmlpull.v1.XmlPullParser;
import org.xmlpull.v1.XmlPullParserException;
import org.xmlpull.v1.XmlPullParserFactory;

public class XppSample {

private String rootTag;
private String splitTag;
private String requiredTag;
private int flushThreshold;
private String fileName;

private String rootTagEnd;

private boolean hasRequiredTag = false;
private int flushCount = 0;
private int fileNo = 0;
private String header;
private XmlPullParser xpp;
private StringBuilder nodeBuf = new StringBuilder();
private StringBuilder fileBuf = new StringBuilder();


public XppSample(String fileName, String rootTag, String splitTag, String requiredTag, int flushThreshold) throws XmlPullParserException, FileNotFoundException {

    this.rootTag = rootTag;
    rootTagEnd = "</" + rootTag + ">";
    this.splitTag = splitTag;
    this.requiredTag = requiredTag;
    this.flushThreshold = flushThreshold;
    this.fileName = fileName; 

    XmlPullParserFactory factory = XmlPullParserFactory.newInstance(System.getProperty(XmlPullParserFactory.PROPERTY_NAME), null);
    factory.setNamespaceAware(true);
    xpp = factory.newPullParser();
    xpp.setInput(new FileReader(fileName));
}


public void processDocument() throws XmlPullParserException, IOException {
    int eventType = xpp.getEventType();
    do {
        if(eventType == XmlPullParser.START_TAG) {
            processStartElement(xpp);
        } else if(eventType == XmlPullParser.END_TAG) {
            processEndElement(xpp);
        } else if(eventType == XmlPullParser.TEXT) {
            processText(xpp);
        }
        eventType = xpp.next();
    } while (eventType != XmlPullParser.END_DOCUMENT);

    saveFile();
}


public void processStartElement(XmlPullParser xpp) {

    int holderForStartAndLength[] = new int[2];
    String name = xpp.getName();
    char ch[] = xpp.getTextCharacters(holderForStartAndLength);
    int start = holderForStartAndLength[0];
    int length = holderForStartAndLength[1];

    if(name.equals(rootTag)) {
        int pos = start + length;
        header = new String(ch, 0, pos);
    } else {
        if(requiredTag==null || name.equals(requiredTag)) {
            hasRequiredTag = true;
        }
        nodeBuf.append(xpp.getText());
    }
}


public void flushBuffer() throws IOException {
    if(hasRequiredTag) {
        fileBuf.append(nodeBuf);
        if(((++flushCount)%flushThreshold)==0) {
            saveFile();
        }           
    }
    nodeBuf = new StringBuilder();
    hasRequiredTag = false;
}


public void saveFile() throws IOException {
    if(fileBuf.length()>0) {
        String splitFile = header + fileBuf.toString() + rootTagEnd;
        FileUtils.writeStringToFile(new File((fileNo++) + "_" + fileName), splitFile);
        fileBuf = new StringBuilder();
    }
}


public void processEndElement (XmlPullParser xpp) throws IOException {

    String name = xpp.getName();

    if(name.equals(rootTag)) {
        flushBuffer();
    } else {
        nodeBuf.append(xpp.getText());
        if(name.equals(splitTag)) {
            flushBuffer();
        }
    }
}


public void processText (XmlPullParser xpp) throws XmlPullParserException {

    int holderForStartAndLength[] = new int[2];
    char ch[] = xpp.getTextCharacters(holderForStartAndLength);
    int start = holderForStartAndLength[0];
    int length = holderForStartAndLength[1];
    String content = new String(ch, start, length);

    nodeBuf.append(content);
}


public static void main (String args[]) throws XmlPullParserException, IOException {

    //XppSample app = new XppSample("input.xml", "company", "staff", "nickname", 3);
    XppSample app = new XppSample("input.xml", "company", "staff", null, 3);
    app.processDocument();
}

} }

Normally I would suggest using StAX, but it is unclear to me how 'stateful' your real XML is. 通常我建议使用StAX,但我不清楚你的真实XML是多么“有状态”。 If simple, then use SAX for ultimate performance, if not-so-simple, use StAX. 如果简单,那么使用SAX获得最佳性能,如果不是那么简单,请使用StAX。 So you need to 所以你需要

  1. read bytes from disk 从磁盘读取字节
  2. convert them to characters 将它们转换为字符
  3. parse the XML 解析XML
  4. determine whether to keep XML or throw away (skip out subtree) 确定是保留XML还是丢弃(跳过子树)
  5. write XML 写XML
  6. convert characters to bytes 将字符转换为字节
  7. write to disk 写入磁盘

Now, it might seem like steps 3-5 are the most resource-intensive, but I would rate them as 现在,似乎步骤3-5是资源最密集的,但我会将它们评为

Most: 1 + 7 大多数:1 + 7
Middle: 2 + 6 中:2 + 6
Least: 3 + 4 + 5 最少:3 + 4 + 5

As operations 1 and 7 are kind of seperate of the rest, you should do them in an async way, at least creating multiple small files is best done in n other threads, if you are familiar with multi-threading . 由于操作1和7与其他操作分开,你应该以异步方式进行,至少创建多个小文件最好在其他线程中完成,如果你熟悉多线程的话 For increased performance, you might also look into the new IO stuff in Java. 为了提高性能,您还可以查看 Java中的新IO内容。

Now for steps 2 + 3 and 5 + 6 you can go a long way with FasterXML , it really does a lot of the stuff you are looking for, like triggering JVM hot-spot attention in the right places; 现在对于步骤2 + 3和5 + 6你可以用FasterXML做很多事情 ,它确实做了很多你正在寻找的东西,比如在正确的位置触发JVM热点注意; might even support async reading/writing looking through the code quickly. 甚至可能支持异步读/写快速查看代码。

So then we are left with step 5, and depending on your logic, you should either 那么我们就离开了第5步,根据你的逻辑,你应该

a. 一种。 make an object binding, then decide how what to do 制作一个对象绑定,然后决定该怎么做
b. write XML anyways, hoping for the best, and then throw it away if no 'staff' element is present. 无论如何写XML,希望最好,然后如果没有'staff'元素就把它扔掉。

Whatever you do, object reuse is sensible. 无论你做什么,对象重用都是明智的。 Note that both alternatives (obisously) requires the same amount of parsing (skip out of subtree ASAP), and for alternative b, that a little extra XML is actually not so bad performancewise, ideally make sure your char buffers are > one unit. 请注意,两个备选方案(obisously)都需要相同数量的解析(跳过子树ASAP),对于备选方案b,一点额外的XML实际上并没有那么糟糕的性能,理想情况下确保您的char缓冲区>一个单元。

Alternative b is the most easy to implement, simply copy the 'xml event' from your reader to writer, example for StAX: 备选方案b是最容易实现的,只需将“xml事件”从您的阅读器复制到编写器,例如StAX:

private static void copyEvent(int event, XMLStreamReader  reader, XMLStreamWriter writer) throws XMLStreamException {
    if (event == XMLStreamConstants.START_ELEMENT) {
        String localName = reader.getLocalName();
        String namespace = reader.getNamespaceURI();
        // TODO check this stuff again before setting in production
        if (namespace != null) {
            if (writer.getPrefix(namespace) != null) {
                writer.writeStartElement(namespace, localName);
            } else {
                writer.writeStartElement(reader.getPrefix(), localName, namespace);
            }
        } else {
            writer.writeStartElement(localName);
        }
        // first: namespace definition attributes
        if(reader.getNamespaceCount() > 0) {
            int namespaces = reader.getNamespaceCount();

            for(int i = 0; i < namespaces; i++) {
                String namespaceURI = reader.getNamespaceURI(i);

                if(writer.getPrefix(namespaceURI) == null) {
                    String namespacePrefix = reader.getNamespacePrefix(i);

                    if(namespacePrefix == null) {
                        writer.writeDefaultNamespace(namespaceURI);
                    } else {
                        writer.writeNamespace(namespacePrefix, namespaceURI);
                    }
                }
            }
        }
        int attributes = reader.getAttributeCount();

        // the write the rest of the attributes
        for (int i = 0; i < attributes; i++) {
            String attributeNamespace = reader.getAttributeNamespace(i);
            if (attributeNamespace != null && attributeNamespace.length() != 0) {
                writer.writeAttribute(attributeNamespace, reader.getAttributeLocalName(i), reader.getAttributeValue(i));
            } else {
                writer.writeAttribute(reader.getAttributeLocalName(i), reader.getAttributeValue(i));
            }
        }
    } else if (event == XMLStreamConstants.END_ELEMENT) {
        writer.writeEndElement();
    } else if (event == XMLStreamConstants.CDATA) {
        String array = reader.getText();
        writer.writeCData(array);
    } else if (event == XMLStreamConstants.COMMENT) {
        String array = reader.getText();
        writer.writeComment(array);
    } else if (event == XMLStreamConstants.CHARACTERS) {
        String array = reader.getText();
        if (array.length() > 0 && !reader.isWhiteSpace()) {
            writer.writeCharacters(array);
        }
    } else if (event == XMLStreamConstants.START_DOCUMENT) {
        writer.writeStartDocument();
    } else if (event == XMLStreamConstants.END_DOCUMENT) {
        writer.writeEndDocument();
    }
}

And for a subtree, 而对于一个子树,

private static void copySubTree(XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
    reader.require(XMLStreamConstants.START_ELEMENT, null, null);

    copyEvent(XMLStreamConstants.START_ELEMENT, reader, writer);

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;
        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
        }

        copyEvent(event, reader, writer);
    } while(level > 0);

}

From which you probably can deduct how to skip out to a certain level. 您可以从中扣除如何跳过某个级别。 In general, for stateful StaX parsing, use the pattern 通常,对于有状态StaX解析,请使用该模式

private static void parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;
            // do stateful stuff here

            // for child logic:
            if(reader.getLocalName().equals("Whatever")) {
                parseSubTreeForWhatever(reader);
                level --; // read from level 1 to 0 in submethod.
            }

            // alternatively, faster
            if(level == 4) {
                parseSubTreeForWhateverAtRelativeLevel4(reader);
                level --; // read from level 1 to 0 in submethod.
            }


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

}

where you in the start of the document read till the first start element and break (add the writer+copy for your use of course, as above). 你在文档的开头读到第一个开始元素和中断(添加作者+副本供你使用当然,如上所述)。

Note that if you do an object binding, these methods should be placed in that object, and equally for the serialization methods. 请注意,如果执行对象绑定,则应将这些方法放在该对象中,对于序列化方法也应如此。

I am pretty sure you will get 10s of MB/s on a modern system, and that should be sufficient. 我很确定你会在现代系统上获得10个MB / s,这应该足够了。 An issue to be investigate further, is approaches to use multiple cores for the actualy input, if you know for a fact the encoding subset, like non-crazy UTF-8, or ISO-8859, then random access might be possible -> send to different cores. 需要进一步研究的一个问题是使用多个内核进行实际输入的方法,如果您知道编码子集的事实,如非疯狂的UTF-8或ISO-8859,那么随机访问可能是 - >发送到不同的核心。

Have fun, and tell use how it went ;) 玩得开心,并告诉它如何去;)

Edit : Almost forgot, if you for some reason are the one who is creating the file in the first place, or you will be reading them after splitting, you will se HUGE performance gains using XML binarization; 编辑 :几乎忘了,如果你出于某种原因首先创建文件,或者你将在拆分后阅读它们,你将使用XML二值化获得巨大的性能提升; there exist XML Schema generators which again can go into code generators. 存在XML Schema生成器,它们可以再次进入代码生成器。 (And some XSLT transform libs use code generation too.) And run with the -server option for JVM. (而且一些XSLT转换库也使用代码生成。)并使用-server选项运行JVM。

My suggestion is that SAX, STAX, or DOM are not the ideal xml parser for your problem, the perfect solutions is called vtd-xml , there is an article on this subject explaining why DOM sax and STAX all done something very wrong... the code below is the shortest you have to write, yet performs 10x faster than DOM or SAX. 我的建议是SAX,STAX或DOM不是你问题的理想xml解析器,完美的解决方案叫做vtd-xml ,有一篇关于这个主题的文章解释了为什么DOM sax和STAX都做错了...下面的代码是您必须编写的最短代码,但执行速度比DOM或SAX快10倍。 http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html

Here is a latest paper entitled Processing XML with Java – A Performance Benchmark : http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf 这是一篇题为“ 使用Java处理XML - 性能基准 ”的最新论文: http//recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

import com.ximpleware.*;
import java.io.*;
public class gandalf {
    public  static void main(String a[]) throws VTDException, Exception{
        VTDGen vg = new VTDGen();
        if (vg.parseFile("c:\\xml\\gandalf.txt", false)){
            VTDNav vn=vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/company/staff[nickname]");
            int i=-1;
            int count=0;
            while((i=ap.evalXPath())!=-1){
                vn.dumpFragment("c:\\xml\\staff"+count+".xml");
                count++;
            }
        }
    }

}

How to make i faster: 如何让我更快:

  1. Use asynchronous writes, possibly in parallel, might boost your perf if you have RAID-X something disks 如果您有RAID-X某些磁盘,则可能并行使用异步写入可能会提高您的性能
  2. Write to an SSD instead of HDD 写入SSD而不是HDD

Here is DOM based solution. 这是基于DOM的解决方案。 I have tested this with the xml you provided. 我用你提供的xml测试了这个。 This needs to be checked against the actual xml files that you have. 这需要根据您拥有的实际xml文件进行检查。

Since this is based on DOM parser, please remember that this will require a lot of memory depending upon your xml file size . 由于这是基于DOM解析器, 请记住,这将需要大量内存,具体取决于您的xml文件大小 But its much faster as it's DOM based. 但它的基于DOM的速度要快得多。

Algorithm : 算法:

  1. Parse the document 解析文档
  2. Extract the root element name 提取根元素名称
  3. Get list he nodes based on the split criteria (using XPath) 根据拆分条件获取节点列表(使用XPath)
  4. For each node, create an empty document with root element name as extracted in step #2 对于每个节点,使用在步骤#2中提取的根元素名称创建一个空文档
  5. Insert the node in this new document 在此新文档中插入节点
  6. Check if nodes are to be filtered or not. 检查是否要过滤节点。
  7. If nodes are to be filtered, then check if a specified element is present in the newly created doc. 如果要过滤节点,则检查新创建的doc中是否存在指定的元素。
  8. If node is not present, don't write to the file. 如果节点不存在,请不要写入文件。
  9. If the nodes are NOT to be filtered at all, don't check for condition in #7, and write the document to the file. 如果根本不过滤节点,请不要检查#7中的条件,并将文档写入文件。

This can be run from command prompt as follows 这可以从命令提示符运行,如下所示

java    XMLSplitter xmlFileLocation  splitElement filter filterElement

For the xml you mentioned it will be 对于您提到的xml,它将是

java    XMLSplitter input.xml  staff  true nickname

In case you don't want to filter 如果您不想过滤

java    XMLSplitter input.xml  staff 

Here is the complete java code: 这是完整的Java代码:

package com.xml.xpath; package com.xml.xpath;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.DOMException;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class XMLSplitter {

    DocumentBuilder builder = null;
    XPath xpath = null; 
    Transformer transformer = null;
    String filterElement;
    String splitElement;
    String xmlFileLocation;
    boolean filter = true;


    public static void main(String[] arg) throws Exception{

        XMLSplitter xMLSplitter = null;
        if(arg.length < 4){

            if(arg.length < 2){
                System.out.println("Insufficient arguments !!!");
                System.out.println("Usage: XMLSplitter xmlFileLocation  splitElement filter filterElement ");
                return;
            }else{
                System.out.println("Filter is off...");
                xMLSplitter = new XMLSplitter();
                xMLSplitter.init(arg[0],arg[1],false,null);
            }

        }else{
            xMLSplitter = new XMLSplitter();
            xMLSplitter.init(arg[0],arg[1],Boolean.parseBoolean(arg[2]),arg[3]);
        }



        xMLSplitter.start();    

    }

    public void init(String xmlFileLocation, String splitElement, boolean filter, String filterElement ) 
                throws ParserConfigurationException, TransformerConfigurationException{

        //Initialize the Document builder
        System.out.println("Initializing..");
        DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
        domFactory.setNamespaceAware(true);   
        builder = domFactory.newDocumentBuilder();

        //Initialize the transformer
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.METHOD, "xml");
        transformer.setOutputProperty(OutputKeys.ENCODING,"UTF-8");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");

        //Initialize the xpath
        XPathFactory factory = XPathFactory.newInstance();
        xpath = factory.newXPath();

        this.filterElement = filterElement;
        this.splitElement = splitElement;
        this.xmlFileLocation = xmlFileLocation;
        this.filter = filter;


    }   


    public void start() throws Exception{

            //Parser the file 
            System.out.println("Parsing file.");
            Document doc = builder. parse(xmlFileLocation);

            //Get the root node name
            System.out.println("Getting root element.");
            XPathExpression rootElementexpr = xpath.compile("/");
            Object rootExprResult = rootElementexpr.evaluate(doc, XPathConstants.NODESET);
            NodeList rootNode = (NodeList) rootExprResult;          
            String rootNodeName = rootNode.item(0).getFirstChild().getNodeName();

            //Get the list of split elements
            XPathExpression expr = xpath.compile("//"+splitElement);
            Object result = expr.evaluate(doc, XPathConstants.NODESET);
            NodeList nodes = (NodeList) result;
            System.out.println("Total number of split nodes "+nodes.getLength());
            for (int i = 0; i < nodes.getLength(); i++) {
                //Wrap each node inside root of the parent xml doc
                Node sigleNode = wrappInRootElement(rootNodeName,nodes.item(i));
                //Get the XML string of the fragment
                String xmlFragment = serializeDocument(sigleNode);
                //System.out.println(xmlFragment);
                //Write the xml fragment in file.
                storeInFile(xmlFragment,i);         
            }

    }

    private  Node wrappInRootElement(String rootNodeName, Node fragmentDoc) 
                throws XPathExpressionException, ParserConfigurationException, DOMException, 
                        SAXException, IOException, TransformerException{

        //Create empty doc with just root node
        DOMImplementation domImplementation = builder.getDOMImplementation();
        Document doc = domImplementation.createDocument(null,null,null);
        Element theDoc = doc.createElement(rootNodeName);
        doc.appendChild(theDoc);

        //Insert the fragment inside the root node 
        InputSource inStream = new InputSource();     
        String xmlString = serializeDocument(fragmentDoc);
        inStream.setCharacterStream(new StringReader(xmlString));       
        Document fr = builder.parse(inStream);
        theDoc.appendChild(doc.importNode(fr.getFirstChild(),true));
        return doc;
    }

    private String serializeDocument(Node doc) throws TransformerException, XPathExpressionException{

        if(!serializeThisNode(doc)){
            return null;
        }

        DOMSource domSource = new DOMSource(doc);                
        StringWriter stringWriter = new StringWriter();
        StreamResult streamResult = new StreamResult(stringWriter);
        transformer.transform(domSource, streamResult);
        String xml = stringWriter.toString();
        return xml;

    }

    //Check whether node is to be stored in file or rejected based on input
    private boolean serializeThisNode(Node doc) throws XPathExpressionException{

         if(!filter){
             return true;
         }

         XPathExpression filterElementexpr = xpath.compile("//"+filterElement);
         Object result = filterElementexpr.evaluate(doc, XPathConstants.NODESET);
         NodeList nodes = (NodeList) result;

         if(nodes.item(0) != null){
             return true;
         }else{
             return false;
         }       
    }

    private void storeInFile(String content, int fileIndex) throws IOException{

        if(content == null || content.length() == 0){
            return;
        }

        String fileName = splitElement+fileIndex+".xml";

        File file = new File(fileName);
        if(file.exists()){
            System.out.println(" The file "+fileName+" already exists !! cannot create the file with the same name ");
            return;
        }
        FileWriter fileWriter = new FileWriter(file);
        fileWriter.write(content);
        fileWriter.close();
        System.out.println("Generated file "+fileName);


    }

}

Let me know if this works for you or any other help regarding this code. 如果这对您或此代码有任何其他帮助,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM