简体   繁体   English

使用Java中的Apache POI将16位字符写入.xlsx文件

[英]Write 16 bits character to .xlsx file using Apache POI in Java

I have a problem in Apache POI. 我在Apache POI中遇到了问题。 The problem is, I try to put a 16 bits character value (such as CJK Unified Ideographs Extension B) to .xlsx file. 问题是,我尝试将16位字符值(例如CJK Unified Ideographs Extension B)放到.xlsx文件中。 However, the cell value become a question mark(like ????) in generated .xlsx file. 但是,单元格值在生成的.xlsx文件中成为问号(如????)。

Anyone know how to handle the 16 bits character value in Apache POI with .xlsx format??? 任何人都知道如何使用.xlsx格式处理Apache POI中的16位字符值???

My POI version is 3.14 我的POI版本是3.14

Code sample as below: 代码示例如下:

XSSFWorkbook workbook = new XSSFWorkbook();
XSSFSheet sheet = workbook.createSheet("Test");

XSSFRow row1 = sheet.createRow(0);
XSSFCell r1c1 = row1.createCell(0);
r1c1.setCellValue("𤆕𤆕𤆕"); // value of CJK Unified Ideographs Extension B
XSSFCell r1c2 = row1.createCell(1);

FileOutputStream fos =new FileOutputStream("D:/temp/test.xlsx");
workbook.write(fos);
fos.close();

Thanks! 谢谢!

The problem exists. 问题存在。 But not with 16 bit (2 byte) Unicode characters from 0x0000 to 0xFFFF . 但不是从0x00000xFFFF 16位(2字节)Unicode字符。 It is with characters which needs more than 2 byte in Unicode encoding. 它的字符在Unicode编码中需要超过2个字节。 Those are the characters which where mentioned as Unicode code points in Java Character : "Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding." 这些是在Java字符中作为Unicode code points提到的字符 :“Unicode代码点用于U + 0000和U + 10FFFF之间范围内的字符值,Unicode代码单元用于16位字符值,即UTF-16编码的代码单元。“ The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. Java平台在char数组和String和StringBuffer类中使用UTF-16表示。 In this representation, supplementary characters (Characters whose code points are greater than U+FFFF) are represented as a pair of char values, the first from the high-surrogates range, (\?-\?), the second from the low-surrogates range (\?-\?). 在此表示中,补充字符(代码点大于U + FFFF的字符)表示为一对char值,第一个来自高代理范围,(\\ uD800- \\ uDBFF),第二个来自低 - 代理范围(\\ uDC00- \\ uDFFF)。

The problem is with org.apache.xmlbeans.impl.store.Saver . 问题出在org.apache.xmlbeans.impl.store.Saver This works with a private char[] _buf . 这适用于private char[] _buf But since char max value is 0xFFFF , Unicode codepoints from 0x10000 to 0x10FFFF are not possible to store in char . 但由于char max值为0xFFFF ,因此从0x100000x10FFFF Unicode代码点无法存储在char So the will be stored as a pair of char values. 因此,将存储为一对char值。

There is a method 有一种方法

    /**
     * Test if a character is valid in xml character content. See
     * http://www.w3.org/TR/REC-xml#NT-Char
     */

    private boolean isBadChar ( char ch )
    {
        return ! (
            (ch >= 0x20 && ch <= 0xD7FF ) ||
            (ch >= 0xE000 && ch <= 0xFFFD) ||
            (ch >= 0x10000 && ch <= 0x10FFFF) ||
            (ch == 0x9) || (ch == 0xA) || (ch == 0xD)
            );
    }

That code is totally buggy since it checks if a char is between 0x10000 and 0x10FFFF . 该代码完全错误,因为它检查char是否在0x100000x10FFFF之间。 As mentioned this is not possible at all. 如上所述,这根本不可能。

Also it excludes the high-surrogates range, (\?-\?) and the low-surrogates range (\?-\?) as bad chars. 此外,它排除了高代理范围(\\ uD800- \\ uDBFF)和低代理范围(\\ uDC00- \\ uDFFF)作为坏字符。 So the code point representations as a pair of char values will be excluded. 因此,将排除作为一对char值的代码点表示。

So the problem results from a bug in org.apache.xmlbeans.impl.store.Saver . 因此问题是由org.apache.xmlbeans.impl.store.Saver的错误引起的。


Patch: 补丁:

Goal: Not exclude the high-surrogates range, (\?-\?), and the low-surrogates range, (\?-\?), as bad chars. 目标:不排除高代理范围(\\ uD800- \\ uDBFF)和低代理范围(\\ uDC00- \\ uDFFF),作为坏字符。 So Unicode code points above U+10000, stored as two 16 bit chars will not be excluded in XML . 因此,在XML不会排除存储为两个16位chars U + 10000以上的Unicode代码点。

Download Saver.java . 下载Saver.java Change the private boolean isBadChar ( char ch ) to private boolean isBadChar ( char ch )更改为

    /**
     * Test if a character is valid in xml character content. See
     * http://www.w3.org/TR/REC-xml#NT-Char
     */
    private boolean isBadChar ( char ch )
    {
        return ! (
            (ch >= 0x20 && ch <= 0xFFFD ) ||
            (ch == 0x9) || (ch == 0xA) || (ch == 0xD)
            );
    }

in both static final class OptimizedForSpeedSaver and static final class TextSaver . static final class OptimizedForSpeedSaverstatic final class TextSaver

Compile Saver.java . 编译Saver.java

Store a backup of xmlbeans-2.6.0.jar somewhere outside the classpath. 在类路径之外的某处存储xmlbeans-2.6.0.jar的备份。

Replace Saver$OptimizedForSpeedSaver.class and Saver$TextSaver.class in xmlbeans-2.6.0.jar -> /org/apache/xmlbeans/impl/store/ with the new compiiled ones. 使用新的Saver$TextSaver.class替换Saver$TextSaver.class Saver$OptimizedForSpeedSaver.classSaver$TextSaver.classxmlbeans-2.6.0.jar - > /org/apache/xmlbeans/impl/store/

Now Unicode code points above U+10000 will be stored in sharedStrings.xml . 现在,U + 10000以上的Unicode代码点将存储在sharedStrings.xml


Disclaimer: This is not well tested. 免责声明:这未经过充分测试。 So don't use this in productive. 所以不要在生产中使用它。 It is only shown here to describe the problem. 这里仅显示描述问题。 Maybe some programmers on xmlbeans.apache.org will find the time to solve the problem with org.apache.xmlbeans.impl.store.Saver properly. 也许xmlbeans.apache.org上的一些程序员会找到时间来正确解决org.apache.xmlbeans.impl.store.Saver的问题。


Update There is a xmlbeans-2.6.2.jar available now. 更新现在有一个xmlbeans-2.6.2.jar可用。 This contains the patch already. 这已包含补丁。


Update There is a xmlbeans-3.0.0.jar available now. 更新现在有一个xmlbeans-3.0.0.jar可用。 This also contains the patch already. 这也包含补丁。

It does: 它确实:

/**
 * Test if a character is valid in xml character content. See
 * http://www.w3.org/TR/REC-xml#NT-Char
 */
static boolean isBadChar ( char ch )
{
    return ! (
        Character.isHighSurrogate(ch) ||
        Character.isLowSurrogate(ch) ||
        (ch >= 0x20 && ch <= 0xD7FF ) ||
        (ch >= 0xE000 && ch <= 0xFFFD) ||
        (ch >= 0x10000 && ch <= 0x10FFFF) ||
        (ch == 0x9) || (ch == 0xA) || (ch == 0xD)
    );
}

So it checks whether char ch is HighSurrogate or LowSurrogate and if so it is not a bad char. 因此它检查char chHighSurrogate还是LowSurrogate ,如果是,它不是一个坏char。 OK. 好。

But nevertheless it checks whether char ch is greater than or equal 0x10000 . 但是它会检查char ch是否大于或等于0x10000 Again: This is not possible for a char ! 再说一次:这对于char是不可能的! The max value of a char is 0xFFFF . char的最大值是0xFFFF

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM