简体   繁体   English

对于超过3.0的Unicode版本,如何将Java字符串转换为xml实体?

[英]How can I convert a Java string to xml entities for versions of Unicode beyond 3.0?

To convert java characters to xml entities, I can do the following for each char in a String: 要将Java字符转换为xml实体,我可以对String中的每个字符执行以下操作:

buf.append("&#x"+ Integer.toHexString(c | 0x10000).substring(1) +";");

However, according to other stackoverflow questions, this only works for Unicode 3.0. 但是,根据其他stackoverflow问题,这仅适用于Unicode 3.0。

If I use a UTF-8 Reader to read in a String, then presumably that String contains the characters in a format that works up through Unicode 6.0 (because Java 7 supports Unicode 6.0 according to the javadoc). 如果我使用UTF-8 Reader读取字符串,则可能是字符串包含的字符格式可以通过Unicode 6.0进行工作(因为根据Javadoc,Java 7支持Unicode 6.0)。

Once I have that String, how can I write it out as xml entities? 一旦有了该字符串,如何将其写为xml实体? Ideally I'd use some api that would continue working as new versions of unicode come out. 理想情况下,我会使用一些会随着新版本的unicode推出而继续工作的api。

Either you are not using correct terminology, or there is a great deal of confusion here. 您使用的术语不正确,或者这里存在很多混乱。

The &#x character reference notation just specifies a numeric codepoint; &#x字符引用符号仅指定数字代码点; it is independent of the version of Unicode used by any reader or parser. 它独立于任何阅读器或解析器使用的Unicode版本。

Your code is actually only compatible with Unicode 1.x, because it assumes a character's numeric value is less than 2 16 . 您的代码实际上仅与Unicode 1.x兼容,因为它假定字符的数字值小于2 16 As of Unicode 2.0 that is not a correct assumption. 从Unicode 2.0开始,这不是正确的假设。 Some characters are represented by a single Java char , while other characters are represented by two Java char s (known as surrogates ). 一些字符由单个Java char ,而其他字符由两个Java char (称为代理 )表示。

I'm not sure what a "UTF-8 Reader" is. 我不确定“ UTF-8阅读器”是什么。 A Reader just reads char values, and does not know about UTF-8 or any other charset, except for InputStreamReader , which uses a CharsetDecoder to translate bytes to chars using the UTF-8 encoding (or whatever encoding a particular CharsetDecoder uses). 读取器仅读取char值,并且不知道UTF-8或任何其他字符集,除了InputStreamReader之外, InputStreamReader使用CharsetDecoder使用UTF-8编码(或特定CharsetDecoder使用的任何编码)将字节转换为char。

In any event, no Reader will parse the XML &#x character reference notation. 无论如何,任何Reader都不会解析XML &#x字符引用符号。 You must use an XML parser for that. 您必须为此使用XML解析器。

No Reader or XML parser is affected by the Unicode version known to Java, because no Reader or XML parser consults a Unicode database in any way. 没有读取器或XML解析器受Java已知的Unicode版本的影响,因为没有读取器或XML解析器以任何方式查询Unicode数据库。 The characters are just treated as numeric values as they are parsed. 字符在解析时仅被视为数字值。 Whether they correspond to assigned codepoints in any Unicode version is never considered. 永远不会考虑它们是否与任何Unicode版本中的已分配代码点相对应。

Finally, to write out a String as XML, you can use a Formatter : 最后,要将String编写为XML,可以使用Formatter

static String toXML(String s) {
    Formatter formatter = new Formatter();
    int len = s.length();
    for (int i = 0; i < len; i = s.offsetByCodePoints(i, 1)) {
        int c = s.codePointAt(i);
        if (c < 32 || c > 126 || c == '&' || c == '<' || c == '>') {
            formatter.format("&#x%x;", c);
        } else {
            formatter.format("%c", c);
        }
    }
    return formatter.toString();
}

As you can see, there is no code that depends on the Unicode version, because the characters are just numeric values. 如您所见,没有代码依赖于Unicode版本,因为字符只是数字值。 Whether each numeric value is an assigned Unicode codepoint is not relevant. 每个数字值是否是分配的Unicode代码点都无关紧要。

(My first inclination was to use the XMLStreamWriter class, but it turns out an XMLStreamWriter that uses a non-Unicode encoding such as ISO-8859-1 or US-ASCII does not properly output surrogate pairs as single character entities, as of Java 1.8.0_05.) (我的第一个倾向是使用XMLStreamWriter类,但是事实证明,使用Java-8或更高版本的XMLStreamWriter使用ISO-8859-1或US-ASCII之类的非Unicode编码不能正确地将代理对作为单个字符实体输出。 .0_05。)

Originally Java supported Unicode 1.0 by making the char type 16 bits long, but Unicode 2.0 introduced a surrogate character mechanism to support more characters than the number allowed in 16 bits, so Java strings became UTF-16 encoded; 最初,Java通过将char类型的长度设置为16位来支持Unicode 1.0,但是Unicode 2.0引入了替代字符机制来支持比16位所允许的数目更多的字符,因此Java字符串成为UTF-16编码。 that means that some characters need two Java chars to be represented, they are called the high surrogate char and the low surrogate char. 这意味着某些字符需要两个Java字符来表示,它们分别称为高代理字符和低代理字符。

To know which chars in a String are actually high/low surrogate pairs, you can use the utility methods in Character : 要知道字符串中的哪些字符实际上是高/低代理对,可以在Character使用实用程序方法:

Character.isHighSurrogate(myChar); // returns true if myChar is a high surrogate
Character.isLowSurrogate(myChar); // same for low surrogate

Character.isSurrogate(myChar); // just to know if myChar is a surrogate

Once you know which chars are high or low surrogate, you need to convert each pair to a unicode codepoint with this method: 一旦知道哪些字符是高或低代理,就需要使用以下方法将每对转换为Unicode代码点:

int codePoint = Character.toCodePoint(highSurrogate, lowSurrogate);

As a piece of code is worth a thousand words, this is an example method to replace to xml character references non us-ascii chars inside a string: 因为一段代码值一千个单词,所以这是一个示例方法,用于替换字符串中非us-ascii字符的xml字符引用:

public static String replaceToCharEntities(String str) {
    StringBuilder result = new StringBuilder(str.length());

    char surrogate = 0;
    for(char c: str.toCharArray()) {

        // if char is a high surrogate, keep it to match it
        // against the next char (low surrogate)
        if(Character.isHighSurrogate(c)) {
            surrogate = c;
            continue;
        }

        // get codePoint
        int codePoint;
        if(surrogate != 0) {
            codePoint = Character.toCodePoint(surrogate, c);
            surrogate = 0;
        } else {
            codePoint = c;
        }

        // decide wether using just a char or a character reference
        if(codePoint < 0x20 || codePoint > 0x7E || codePoint == '<'
                || codePoint == '>' || codePoint == '&' || codePoint == '"'
                || codePoint == '\'') {
            result.append(String.format("&#x%x;", codePoint));
        } else {
            result.append(c);
        }
    }

    return result.toString();
}

The next string example is a good one to test with, as it contains a non-ascii char that can be represented with a 16 bit value and also a char with a high/low surrogate pair: 下一个字符串示例是一个很好的示例,因为它包含一个可以用16位值表示的非ASCII字符,以及一个具有高/低代理对的字符:

String myString = "text with some non-US chars: 'Ñ' and '𐌈'";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM