简体   繁体   English

Apache Camel处理XML文件中声明的编码

[英]Apache Camel to handle encoding declared in XML-File

I'm trying to parse an UTF-16 encoded document using Apache Camel Splitter with xtokenize, this delegates to Woodstox (com.ctc.wstx.sr.BasicStreamReader), also I cannot know the encoding of a file before I read it, currently some files are UTF-16, others UTF-8: 我正在尝试使用带有xtokenize的Apache Camel Splitter解析UTF-16编码的文档,该文档委托给Woodstox(com.ctc.wstx.sr.BasicStreamReader),在读取文件之前,我目前也不知道文件的编码,目前有些文件是UTF-16,其他文件是UTF-8:

.split().xtokenize(getToken(), 'w', NAMESPACES)

The problem I encounter is that Camel tells Woodstox which encoding to use: 我遇到的问题是Camel告诉Woodstox使用哪种编码:

String charset = IOHelper.getCharsetName(exchange);

It sets the default UTF-8 as encoding, so BasicStreamReader tries to read BOM bytes as UTF-8 and fails with 它将默认的UTF-8设置为编码,因此BasicStreamReader尝试将BOM字节读取为UTF-8并失败

com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '�' (code 65533 / 0xfffd) in prolog; expected '<'

As specified in https://www.w3.org/TR/xml/#sec-guessing XML Parser (Woodstox) should be able to autodetect the file encoding if only Camel lets it do the work. https://www.w3.org/TR/xml/#sec-guessing XML解析器(Woodstox)中所指定,如果只有Camel允许,XML解析器(Woodstox)应该能够自动检测文件编码。

Is there a way not to implement the encoding detection myself? 有没有办法自己不执行编码检测?

Okay I can see the current source code will fallback and use the platform encoding. 好的,我可以看到当前的源代码将回退并使用平台编码。 So your use-case with the encoding provided in the XML stanza is not supported. 因此,不支持您的用例以及XML节中提供的编码。

I am not sure if Camel really need to fallback to a default platform encoding as it uses the java.util.Scanner in the splitter, and it supports scanning without using a specific encoding. 我不确定Camel是否真的需要回退到默认平台编码,因为它在拆分器中使用java.util.Scanner ,并且它支持不使用特定编码的扫描。

Maybe you can try to patch the source code in the XMLTokenExpressionIterator and test it locally for you, and report back here. 也许您可以尝试在XMLTokenExpressionIterator修补源代码,并在本地为您进行测试,然后在此处进行报告。

We can then likely take a look at make it optional in Apache Camel to use the fallback encoding or not. 然后,我们可能会看一下它是否在Apache Camel中成为可选项,以使用或不使用后备编码。

And in your current version of Apache Camel you can always extend XMLTokenExpressionIterator and override the doEvaluate method and then call the createIterator method without a charset parameter. 并且在当前版本的Apache Camel中,您始终可以扩展XMLTokenExpressionIterator并覆盖doEvaluate方法,然后在不使用charset参数的情况下调用createIterator方法。 And then use your custom iterator with the Camel splitter. 然后将您的自定义迭代器与Camel拆分器一起使用。

Created a Camel JIRA ticket: https://issues.apache.org/jira/browse/CAMEL-11846 From my comments you can see there is no easy solution for splitting UTF-16 XML with Camel without knowing it's UTF-16 in advance. 创建了骆驼JIRA票证: https : //issues.apache.org/jira/browse/CAMEL-11846从我的评论中您可以看到,使用Camel拆分UTF-16 XML并没有预先知道它是UTF-16的简单解决方案。

Though subclassing XMLTokenExpressionIterator, which is an ExpressionAdapter and switching to InputStream works in the first place, there are several other places with xslt & xpath & conversion to StaxSource where it will break for the same reason. 尽管首先将XMLTokenExpressionIterator子类化,这是一个ExpressionAdapter并切换到InputStream,但在其他几个地方,由于相同的原因,xslt&xpath并转换为StaxSource会中断。

As a workaround I consider it's easier to let XmlStreamReader find out encoding in advance (happens at the initialization) and setting Exchange.CHARSET_NAME header or property. 作为一种解决方法,我认为让XmlStreamReader预先找出编码(在初始化时发生)并设置Exchange.CHARSET_NAME标头或属性会更容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM