简体   繁体   中英

Invalid UTF8 encoding on processing xml file

I have a Java code that process an XML file to read some values. I got an error : Invalid UTF8 encoding , I tried to copy the file contents in an other file on NotePad++ , the process worked fine but if I only save the file as an other name it's give the same error. Sorry, I can not put my XML file here because it is too big I will put only the header and trailer . Any help is appreciated to resolve this error. my java code to process xml file :

XPathFactory f=XPathFactory.newInstance();
    XPath x=f.newXPath();

    InputSource source=new InputSource(new FileInputStream("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml") );
    InputSource source2=new InputSource(new FileInputStream("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml") );

    XPathExpression trlr=x.compile("pers/trailer/text()");
    XPathExpression hdr=x.compile("pers/header/CD/text()");

    String s=trlr.evaluate(source);
    String s2=hdr.evaluate(source2);
    System.out.println("header :"+s+" trailer"+s2);

pers is the root tag in the xml file:

XML file looks like this :

<?xml version = '1.0' encoding = 'UTF-8'?>
<pers>
 <header>555</header>
 .
 .
 .
 .
 <trailer>666</trailer>

</pers>

stack trace :

java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:229)
    at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:274)
    at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:189)
    at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:452)
    at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2776)
    at oracle.xml.parser.v2.XMLReader.scanNameChars(XMLReader.java:1352)
    at oracle.xml.parser.v2.XMLReader.readQName(XMLReader.java:2149)
    at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1579)
    at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:448)
    at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:394)
    at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:236)
    at oracle.xml.jaxp.JXDocumentBuilder.parse(JXDocumentBuilder.java:175)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:302)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:332)
    at tasklets.HeaderFooter.execute(HeaderFooter.java:39)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:406)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:330)
    at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:133)
    at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:272)
    at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:81)
    at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:374)
    at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
    at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:144)
    at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:257)
    at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:200)
    at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:148)
    at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:64)
    at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:67)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:169)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:144)
    at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:134)
    at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:306)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:135)
    at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:128)
    at main.IncomeResponseFile.main(IncomeResponseFile.java:39)
--------------- linked to ------------------
javax.xml.xpath.XPathExpressionException: java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:305)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:332)
    at tasklets.HeaderFooter.execute(HeaderFooter.java:39)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:406)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:330)
    at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:133)
    at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:272)
    at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:81)
    at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:374)
    at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
    at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:144)
    at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:257)
    at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:200)
    at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:148)
    at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:64)
    at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:67)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:169)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:144)
    at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:134)
    at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:306)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:135)
    at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:128)
    at main.IncomeResponseFile.main(IncomeResponseFile.java:39)
Caused by: java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:229)
    at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:274)
    at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:189)
    at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:452)
    at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2776)
    at oracle.xml.parser.v2.XMLReader.scanNameChars(XMLReader.java:1352)
    at oracle.xml.parser.v2.XMLReader.readQName(XMLReader.java:2149)
    at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1579)
    at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:448)
    at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:394)
    at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:236)
    at oracle.xml.jaxp.JXDocumentBuilder.parse(JXDocumentBuilder.java:175)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:302)
    ... 23 more

Use java for scripting to detect the problematic line.

AtomicInteger lineno = new AtomicInteger();
Path path = Paths.get("... .xml");
Files.lines(path, StandardCharsets.ISO_8859_1)
    .forEach(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        try {
            new String(b, StandardCharsets.UTF_8);
        } catch (Exception e) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            //throw new IllegalStateException(e);
        }
    });

One may assume it is a data error.

In general it could also be erroneous, buffered reading: when a mult-byte sequence is broken on a buffer boundary; then two wrong half sequences could arise. Unlikely in standard library code.


To ensure the code of new String(...) does not get discarded by the JVM, maybe:

int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)
    .mapToInt(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        try {
            return new String(b, StandardCharsets.UTF_8).length();
        } catch (Exception e) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            throw new IllegalStateException(e); // Must throw or return int
        }
    }).sum();
System.out.println("Ignore this: " + sowhat);

One may assume it is a data error.

In general it could also be erroneous, buffered reading: when a mult-byte sequence is broken on a buffer boundary; then two wrong half sequences could arise. Unlikely in standard library code.


To ensure the code of new String(...) does not get discarded by the JVM, maybe:

int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)
    .mapToInt(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        try {
            return new String(b, StandardCharsets.UTF_8).length();
        } catch (Exception e) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            throw new IllegalStateException(e); // Must throw or return int
        }
    }).sum();

Illegal XML characters (in version 1.0)? [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)
    .mapToInt(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        if (!legal(b)) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            throw new IllegalStateException(e); // Must throw or return int
        }
    }).sum();

static boolean legal(byte[] bytes) {
    String s = new String(bytes, StandardCharsets.UTF_8);
    for (char ch : s.toCharArray()) {
        int x = ch;
        if ((0 <= x && x <= 8)               // ASCII control chars
                || (0xB <= x && x <= 0xC)
                || (0xE <= x && x <= 0x1F)
                || (0x7f <= x && x <= 0x84)  // DEL + Unicode control chars
                || (0x86 <= x && x <= 0x9F)) {
            return false;
        }
    }
    return true;
}

Should this not work, I have kept you long enough. Split the file and validate the parts.

I used this code to convert the file to UTF-8 format :

 File source = new File("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml");
    String srcEncoding="ISO-8859-1";
    File target = new File("C:\\Users\\cc\\eclipse-workspace\\data\\file2.xml");
    String tgtEncoding="UTF-8";
      try (
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(source), srcEncoding));
        BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), tgtEncoding)); ) {
            char[] buffer = new char[16384];
            int read;
            while ((read = br.read(buffer)) != -1)
                bw.write(buffer, 0, read);

  }

after that I used file2 it worked . thanks to : java: how to convert a file to utf8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM