简体   繁体   English

如何解析大型复杂xml

[英]how to parse large complex xml

I need to parse a large complex xml and write to a Flat file, can you give some advise?我需要解析一个大型复杂的 xml 并写入一个平面文件,你能给一些建议吗?

File size: 500MB Record count: 100K XML structure:文件大小:500MB 记录数:100K XML 结构:

<Msg>

    <MsgHeader>
        <!--Some of the fields in the MsgHeader need to be map to a java object-->
    </MsgHeader>

    <GroupA> 
        <GroupAHeader/>
        <!--Some of the fields in the GroupAHeader need to be map to a java object--> 
        <GroupAMsg/>
        <!--50K records--> 
        <GroupAMsg/> 
        <GroupAMsg/> 
        <GroupAMsg/> 
    </GroupA>

    <GroupB> 
        <GroupBHeader/> 
        <GroupBMsg/>
        <!--50K records--> 
        <GroupBMsg/> 
        <GroupBMsg/> 
        <GroupBMsg/> 
    </GroupB>

</Msg>

Within Spring Batch, I've written my own stax event item reader implementation that operates a bit more specifically than previously mentioned.在 Spring Batch 中,我编写了自己的 stax 事件项读取器实现,其操作比前面提到的要具体一些。 Basically, I just stuff elements into a map and then pass them into the ItemProcessor.基本上,我只是将元素填充到地图中,然后将它们传递到 ItemProcessor。 From there, you're free to transform it into a single object (see CompositeItemProcessor) from the "GatheredElement".从那里,您可以自由地将它从“GatheredElement”转换为单个对象(请参阅 CompositeItemProcessor)。 Apologies for having a little copy/paste from the StaxEventItemReader, but I don't think it's avoidable.很抱歉从 StaxEventItemReader 中复制/粘贴了一些内容,但我认为这是无法避免的。

From here, you're free to use whatever OXM marshaller you'd like, I happen to use JAXB as well.从这里开始,您可以随意使用任何您喜欢的 OXM 编组器,我碰巧也使用 JAXB。

public class ElementGatheringStaxEventItemReader<T> extends StaxEventItemReader<T> {
    private Map<String, String> gatheredElements;
    private Set<String> elementsToGather;
    ...
    @Override
    protected boolean moveCursorToNextFragment(XMLEventReader reader) throws NonTransientResourceException {
        try { 
            while (true) {
                while (reader.peek() != null && !reader.peek().isStartElement()) {
                    reader.nextEvent();
                }
                if (reader.peek() == null) {
                    return false;
                }
                QName startElementName = ((StartElement) reader.peek()).getName();
                if(elementsToGather.contains(startElementName.getLocalPart())) {
                    reader.nextEvent(); // move past the actual start element
                    XMLEvent dataEvent = reader.nextEvent();
                    gatheredElements.put(startElementName.getLocalPart(), dataEvent.asCharacters().getData());
                    continue;
                }
                if (startElementName.getLocalPart().equals(fragmentRootElementName)) {
                    if (fragmentRootElementNameSpace == null || startElementName.getNamespaceURI().equals(fragmentRootElementNameSpace)) {
                        return true;
                    }
                }
                reader.nextEvent();

            }
        } catch (XMLStreamException e) {
            throw new NonTransientResourceException("Error while reading from event reader", e);
        }
    }

    @SuppressWarnings("unchecked")
    @Override
    protected T doRead() throws Exception {
        T item = super.doRead();
        if(null == item)
            return null;
        T result = (T) new GatheredElementItem<T>(item, new     HashedMap(gatheredElements));
        if(log.isDebugEnabled())
            log.debug("Read GatheredElementItem: " + result);
        return result; 
    }

The gathered element class is pretty basic:收集的元素类非常基本:

public class GatheredElementItem<T> {
    private final T item;
    private final Map<String, String> gatheredElements;
    ...
}

I haven't dealt with such huge file sizes, but considering your problem, since you want to parse the and write to a flat file, I'm guessing a combination XML Pull Parsing and smart code to write to the flat file ( this might help ), because we don't want to exhaust the Java heap.我还没有处理过如此巨大的文件大小,但考虑到您的问题,因为您想解析并写入平面文件,我猜想结合XML Pull Parsing和智能代码来写入平面文件( 这可能会有所帮助),因为我们不想耗尽 Java 堆。 You can do a quick Google search for tutorials and sample code on using XML Pull Parsing.您可以在 Google 上快速搜索有关使用 XML Pull Parsing 的教程和示例代码。

At last, I implement a customized StaxEventItemReader.最后,我实现了一个自定义的 StaxEventItemReader。

  1. Config fragmentRootElementName配置片段RootElementName

  2. Config my own manualHandleElement配置我自己的 manualHandleElement

     <property name="manualHandleElement"> <list> <map> <entry> <key><value>startElementName</value></key> <value>GroupA</value> </entry> <entry> <key><value>endElementName</value></key> <value>GroupAHeader</value> </entry> <entry> <key><value>elementNameList</value></key> <list> <value>/GroupAHeader/Info1</value> <value>/GroupAHeader/Info2</value> </list> </entry> </map> </list>

  3. Add following fragment in MyStaxEventItemReader.doRead()在 MyStaxEventItemReader.doRead() 中添加以下片段

    while(true){ if(reader.peek() != null && reader.peek().isStartElement()){ pathList.add("/"+((StartElement) reader.peek()).getName().getLocalPart()); reader.nextEvent(); continue; } if(reader.peek() != null && reader.peek().isEndElement()){ pathList.remove("/"+((EndElement) reader.peek()).getName().getLocalPart()); if(isManualHandleEndElement(((EndElement) reader.peek()).getName().getLocalPart())){ pathList.clear(); reader.nextEvent(); break; } reader.nextEvent(); continue; } if(reader.peek() != null && reader.peek().isCharacters()){ CharacterEvent charEvent = (CharacterEvent)reader.nextEvent(); String currentPath = getCurrentPath(pathList); String startElementName = (String)currentManualHandleStartElement.get(MANUAL_HANDLE_START_ELEMENT_NAME); for(Object s : (List)currentManualHandleStartElement.get(MANUAL_HANDLE_ELEMENT_NAME_LIST)){ if(("/"+startElementName+s).equals(currentPath)){ map.put(getCurrentPath(pathList), charEvent.getData()); break; } } continue; } reader.nextEvent();

    } }

give a try to some ETL tool like尝试一些 ETL 工具,例如

Pentaho Data Integration (AKA Kettle) Pentaho 数据集成(又名 Kettle)

If you accept an solution aside JAXB/Spring Batch, you may want to have a look at the SAX Parser.如果您接受 JAXB/Spring Batch 之外的解决方案,您可能需要查看 SAX 解析器。

This is a more event-oriented way of parsing XML files and may be a good approach when you want to directly write into the target file while parsing.这是一种更面向事件的 XML 文件解析方式,当您想在解析时直接写入目标文件时,这可能是一种很好的方法。 The SAX Parser is not reading the whole xml content into memory but triggers methods when it enconters elements in the inputstream. SAX 解析器不会将整个 xml 内容读入内存,而是在输入流中输入元素时触发方法。 As far as I have experienced it, this is a very memory-efficient way of processing.就我所经历的而言,这是一种非常节省内存的处理方式。

In comparison to your Stax-Solution, SAX 'pushes' the data into your application - this means that you have to maintain the state (like in which tag you are corrently), so you have to keep track of your current location.与您的 Stax-Solution 相比,SAX 将数据“推送”到您的应用程序中 - 这意味着您必须维护状态(例如您所在的标签),因此您必须跟踪您的当前位置。 I'm not sure if that is something you really require我不确定这是否是您真正需要的东西

The following example reads in an xml file in your structure and prints out all text within GroupBMsg-Tags:以下示例读入您结构中的 xml 文件并打印出 GroupBMsg-Tags 中的所有文本:

import java.io.FileReader;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

public class SaxExample implements ContentHandler
{
    private String currentValue;

    public static void main(final String[] args) throws Exception
    {
        final XMLReader xmlReader = XMLReaderFactory.createXMLReader();

        final FileReader reader = new FileReader("datasource.xml");
        final InputSource inputSource = new InputSource(reader);

        xmlReader.setContentHandler(new SaxExample());
        xmlReader.parse(inputSource);
    }

    @Override
    public void characters(final char[] ch, final int start, final int length) throws     SAXException
    {
        currentValue = new String(ch, start, length);
    }

    @Override
    public void startElement(final String uri, final String localName, final String     qName, final Attributes atts) throws SAXException
    {
        // react on the beginning of tag "GroupBMsg" <GroupBMSg>
        if (localName.equals("GroupBMsg"))
        {
            currentValue="";
        }
    }

    @Override
    public void endElement(final String uri, final String localName, final String     qName) throws SAXException
    {
        // react on the ending of tag "GroupBMsg" </GroupBMSg>
        if (localName.equals("GroupBMsg"))
        {
            // TODO: write into file
            System.out.println(currentValue);
        }
    }


    // the rest is boilerplate code for sax

    @Override
    public void endDocument() throws SAXException {}
    @Override
    public void endPrefixMapping(final String prefix) throws SAXException {}
    @Override
    public void ignorableWhitespace(final char[] ch, final int start, final int length)
        throws SAXException {}
    @Override
    public void processingInstruction(final String target, final String data)
        throws SAXException {}
    @Override
    public void setDocumentLocator(final Locator locator) {  }
    @Override
    public void skippedEntity(final String name) throws SAXException {}
    @Override
    public void startDocument() throws SAXException {}
    @Override
    public void startPrefixMapping(final String prefix, final String uri)
      throws SAXException {}
}

You can use Declarative Stream Mapping (DSM) stream parsing library.您可以使用声明式流映射 (DSM)流解析库。 It can process both JSON and XML.它可以处理 JSON 和 XML。 It doesn't load XML file in to memory.它不会将 XML 文件加载到内存中。 DSM only process data that you defined in YAML or JSON config. DSM 仅处理您在 YAML 或 JSON 配置中定义的数据。

You can call method while reading XML.This allows you to process XML partially.您可以在读取 XML 时调用方法。这允许您部分处理 XML。 You can deserialzie this partially read XML data to Java object.您可以将此部分读取的 XML 数据反序列化为 Java 对象。

Even you can use it to read in multiple thread.甚至你可以用它在多线程中阅读。

You can find good example in this Answer你可以在这个答案中找到很好的例子

Unmarshalling XML to three lists of different objects using STAX Parser 使用 STAX Parser 将 XML 解组为三个不同对象的列表

JAVA - Best approach to parse huge (extra large) JSON file (same for XML) JAVA - 解析巨大(超大)JSON 文件的最佳方法(与 XML 相同)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM