无法识别 Saxon 输入编码？

Question

I get weird characters in a utf-8 text output from Saxon xslt processor.我在 Saxon xslt 处理器的 utf-8 文本输出中得到了奇怪的字符。

The input xml is headed with输入 xml 以

<?xml version="1.0" encoding="windows-1252"?>

It contains strings like (shown in notepad++ with Windows-1252 encoding shown down right)它包含类似的字符串（在记事本++中显示，右侧显示为 Windows-1252 编码）

“abc”

The transformation stylesheet contains转换样式表包含

<xsl:output method="text" encoding="utf-8" />

but the output contains (shown in notepad++ with UTF-8 encoding shown down right)但输出包含（显示在记事本++中，UTF-8编码显示在右下方）

ï¿½abcï¿½

instead of UTF-8 encoded而不是 UTF-8 编码

“abc”

Any idea what I missed?知道我错过了什么吗？

ps: when I use notepad++ to change the xml input from windows-1252 to UTF-8, the output is encoded correctly, and that is my workaround. ps：当我使用 notepad++ 将 xml 输入从 windows-1252 更改为 UTF-8 时，输出已正确编码，这就是我的解决方法。 However I'd like to understand whether I missed something or some software should be improved regarding character sets.但是，我想了解我是否遗漏了某些内容，或者某些软件是否应该在字符集方面得到改进。

Answer 1

I suspect that although the input is labelled as being windows-1252, it isn't actually Windows-1252.我怀疑虽然输入被标记为 windows-1252，但它实际上不是 Windows-1252。

First, try to find out whether the problem is on input or on serialization.首先，尝试找出问题是在输入上还是在序列化上。 You can do that by using string-to-codepoints() within the XSLT code to see what actual codepoints are present in the parsed node tree.您可以通过在 XSLT 代码中使用string-to-codepoints()来查看解析的节点树中存在哪些实际代码点。

If it's an input problem, then that's the responsibility of the XML parser rather than Saxon itself, so it depends on which XML parser you are using.如果是输入问题，则是 XML 解析器的责任，而不是 Saxon 本身，因此这取决于您使用的 XML 解析器。

无法识别 Saxon 输入编码？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-09-18 09:29:41

无法识别 Saxon 输入编码？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-09-18 09:29:41

解决方案1
0 已采纳 2020-09-18 09:29:41