简体   繁体   English

消息:hadoop中1字节UTF-8序列的无效字节1

[英]Message: Invalid byte 1 of 1-byte UTF-8 sequence in hadoop

I'm parsing XML using Hadoop, and I got the code from here . 我正在使用Hadoop解析XML,并且从这里获得了代码。

But I'm getting the following error: 但我收到以下错误:

FINISH_TIME="1385387129970" HOSTNAME="DEV140" ERROR="java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[18,3] Message: Invalid byte 1 of 1-byte UTF-8 sequence. FINISH_TIME =“ 1385387129970” HOSTNAME =“ DEV140” ERROR =“ java.io.IOException:javax.xml.stream.XMLStreamException:[row,col]处的ParseError:[18,3]消息:1字节UTF的无效字节1 -8序列。

But my XML is encoded with UTF-8 only . 但是我的XML仅使用UTF-8编码。 So how can I handle it? 那我该如何处理呢?

I suspect this is the problem - it's at least a problem: 怀疑这是问题-至少问题:

XMLStreamReader reader =
    XMLInputFactory.newInstance().createXMLStreamReader(new
        ByteArrayInputStream(document.getBytes()));

That call to getBytes will use the platform default encoding, rather than UTF-8. getBytes调用将使用平台默认编码,而不是UTF-8。

You could specify "utf-8" as the encoding name - but it would be simpler to create a StringReader : 可以指定"utf-8"作为编码名称-但是创建StringReader会更简单:

XMLStreamReader reader = XMLInputFactory.newInstance()
    .createXMLStreamReader(new StringReader(document));

Of course that may not be the only error, but it's at least something to look at. 当然,这可能不是唯一的错误,但至少是要看的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM