简体   繁体   中英

SAX: XML document structures must start and end within the same entity

I'm trying to parse (fairly big) XML files using javax.xml.stream.XMLStreamReader . The files are well-formed (validated with xmllint), but still I get the following exception:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[12418,95]
Message: XML document structures must start and end within the same entity.
at     com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:592)

This is a simplification of my code:

while(parser.hasNext()){
    parser.next();
    if (parser.getEventType() == XMLStreamReader.START_ELEMENT){
        if (parser.getLocalName() == "s") {
            // do stuff
        }
    }
    if (parser.getEventType() == XMLStreamReader.END_ELEMENT){
        if (parser.getLocalName() == "s") {
            // do more stuff                
        }
    }
    if (parser.getEventType() == XMLStreamReader.CHARACTERS){
        if (inSentenceElement) {
            // process text
            parser.getText()...
        }
    }
}

I've checked the row/col in the XML as given in the error message, with nothing striking me as unusual. I've been thinking that the size of the files might be a problem and that they get truncated so that an EOF is read before the root element is closed. Is that feasible and if yes, how can I avoid that?

Edit: the bz2-zipped files are up to 1.5G in size with up to 7M lines, but also relatively small files at 4M crash after around 10K lines (although the number of lines after which the problem occurs tends to vary by some 3K lines.

Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,4207737]
Message: Attribute name "i" associated with an element type "someElement" must be followed by the ' = ' character.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXStreamConnector.bridge(StAXStreamConnector.java:181)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:355)
    ... 49 more

The attribute in the actual XML is: index="1", so it's valid, but it's being truncated or something. The same code and XML worked with Java 1.7.0u51, but fails with the above exception with 1.7.0u71. Location is always at the same column (CharacterOffset = 4207736) with that file. I'm using JAXB, which calls this during unmarshalling, but nothing has changed other than Java versions.

I would recommend checking some of the new XML limits recently added to reduce the denial of service attacks, it did work for my case. https://docs.oracle.com/javase/tutorial/jaxp/limits/using.html

Specifically, adding the following to the command line running disables all of them. I would STRONGLY recommend finding better limits (or the specific one that causes your problem) instead of turning them all off with 0.

java -Djdk.xml.entityExpansionLimit=0 -Djdk.xml.elementAttributeLimit=0 -Djdk.xml.maxOccurLimit=0 -Djdk.xml.totalEntitySizeLimit=0 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=0 -Djdk.xml.maxElementDepth=0    -jar myJarfile.jar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM