繁体   English   中英

处理xml文件时无效的UTF8编码

[英]Invalid UTF8 encoding on processing xml file

我有一个处理XML文件以读取某些值的Java代码。 我收到一个错误: 无效的UTF8编码 ,我试图将文件内容复制到NotePad ++上的另一个文件中,该过程运行良好,但如果我仅将文件另存为另一个名称,则会出现相同的错误。 抱歉,我不能将XML文件放在这里,因为它太大了,我将只放置header和trailer。 感谢您提供任何帮助来解决此错误。 我的Java代码来处理xml文件:

XPathFactory f=XPathFactory.newInstance();
    XPath x=f.newXPath();

    InputSource source=new InputSource(new FileInputStream("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml") );
    InputSource source2=new InputSource(new FileInputStream("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml") );

    XPathExpression trlr=x.compile("pers/trailer/text()");
    XPathExpression hdr=x.compile("pers/header/CD/text()");

    String s=trlr.evaluate(source);
    String s2=hdr.evaluate(source2);
    System.out.println("header :"+s+" trailer"+s2);

pers是xml文件中的根标记:

XML文件如下所示:

<?xml version = '1.0' encoding = 'UTF-8'?>
<pers>
 <header>555</header>
 .
 .
 .
 .
 <trailer>666</trailer>

</pers>

堆栈跟踪 :

java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:229)
    at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:274)
    at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:189)
    at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:452)
    at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2776)
    at oracle.xml.parser.v2.XMLReader.scanNameChars(XMLReader.java:1352)
    at oracle.xml.parser.v2.XMLReader.readQName(XMLReader.java:2149)
    at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1579)
    at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:448)
    at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:394)
    at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:236)
    at oracle.xml.jaxp.JXDocumentBuilder.parse(JXDocumentBuilder.java:175)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:302)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:332)
    at tasklets.HeaderFooter.execute(HeaderFooter.java:39)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:406)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:330)
    at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:133)
    at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:272)
    at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:81)
    at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:374)
    at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
    at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:144)
    at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:257)
    at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:200)
    at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:148)
    at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:64)
    at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:67)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:169)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:144)
    at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:134)
    at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:306)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:135)
    at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:128)
    at main.IncomeResponseFile.main(IncomeResponseFile.java:39)
--------------- linked to ------------------
javax.xml.xpath.XPathExpressionException: java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:305)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:332)
    at tasklets.HeaderFooter.execute(HeaderFooter.java:39)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:406)
    at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:330)
    at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:133)
    at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:272)
    at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:81)
    at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:374)
    at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
    at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:144)
    at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:257)
    at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:200)
    at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:148)
    at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:64)
    at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:67)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:169)
    at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:144)
    at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:134)
    at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:306)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:135)
    at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
    at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:128)
    at main.IncomeResponseFile.main(IncomeResponseFile.java:39)
Caused by: java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:229)
    at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:274)
    at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:189)
    at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:452)
    at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2776)
    at oracle.xml.parser.v2.XMLReader.scanNameChars(XMLReader.java:1352)
    at oracle.xml.parser.v2.XMLReader.readQName(XMLReader.java:2149)
    at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1579)
    at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:448)
    at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:394)
    at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:236)
    at oracle.xml.jaxp.JXDocumentBuilder.parse(JXDocumentBuilder.java:175)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:302)
    ... 23 more

使用Java编写脚本来检测有问题的行。

AtomicInteger lineno = new AtomicInteger();
Path path = Paths.get("... .xml");
Files.lines(path, StandardCharsets.ISO_8859_1)
    .forEach(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        try {
            new String(b, StandardCharsets.UTF_8);
        } catch (Exception e) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            //throw new IllegalStateException(e);
        }
    });

可能会认为这是数据错误。

通常,它也可能是错误的缓冲读取:当在缓冲区边界上中断了多字节序列时; 那么可能会出现两个错误的半序列。 在标准库代码中不太可能。


为确保new String(...)的代码不会被JVM丢弃,可能是:

int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)
    .mapToInt(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        try {
            return new String(b, StandardCharsets.UTF_8).length();
        } catch (Exception e) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            throw new IllegalStateException(e); // Must throw or return int
        }
    }).sum();
System.out.println("Ignore this: " + sowhat);

可能会认为这是数据错误。

通常,它也可能是错误的缓冲读取:当在缓冲区边界上中断了多字节序列时; 那么可能会出现两个错误的半序列。 在标准库代码中不太可能。


为确保new String(...)的代码不会被JVM丢弃,可能是:

int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)
    .mapToInt(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        try {
            return new String(b, StandardCharsets.UTF_8).length();
        } catch (Exception e) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            throw new IllegalStateException(e); // Must throw or return int
        }
    }).sum();

非法的XML字符(在1.0版中)? [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86的#x9F]

int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)
    .mapToInt(line -> {
        int no = lineno.incrementAndGet();
        byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);
        if (!legal(b)) {
            System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());
            throw new IllegalStateException(e); // Must throw or return int
        }
    }).sum();

static boolean legal(byte[] bytes) {
    String s = new String(bytes, StandardCharsets.UTF_8);
    for (char ch : s.toCharArray()) {
        int x = ch;
        if ((0 <= x && x <= 8)               // ASCII control chars
                || (0xB <= x && x <= 0xC)
                || (0xE <= x && x <= 0x1F)
                || (0x7f <= x && x <= 0x84)  // DEL + Unicode control chars
                || (0x86 <= x && x <= 0x9F)) {
            return false;
        }
    }
    return true;
}

如果这不起作用,我已经为您保留了足够长的时间。 分割文件并验证零件。

我使用以下代码将文件转换为UTF-8格式:

 File source = new File("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml");
    String srcEncoding="ISO-8859-1";
    File target = new File("C:\\Users\\cc\\eclipse-workspace\\data\\file2.xml");
    String tgtEncoding="UTF-8";
      try (
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(source), srcEncoding));
        BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), tgtEncoding)); ) {
            char[] buffer = new char[16384];
            int read;
            while ((read = br.read(buffer)) != -1)
                bw.write(buffer, 0, read);

  }

之后,我使用了file2。 感谢: java:如何将文件转换为utf8

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM