简体   繁体   English

Java Unicode以可读文本转换解码

[英]Java Unicode to readable text conversion decoding

I am developing a Java application where I am consuming a web service. 我正在开发一个Java应用程序,我正在使用Web服务。 The web service is created using a SAP server, which encodes the data automatically in Unicode. Web服务使用SAP服务器创建,该服务器以Unicode自动编码数据。 I get a Unicode string from the web service. 我从Web服务获取Unicode字符串。

" 倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2 " “倥䙄ㄭ쿣ී㈊〠椭椭椭椭椭椭湥湥湥湥湥椭椭4 4 4 4 4 4 4㰊഼瑶⁥⁥呓†††䘠佃佃佃⁒⁒⁒牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍汇扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据挚湩'‰൒㸊ാ攊摮扯4㐊〠漠椭਍㰼਍湥瑧慥慥慥慥慥䘯7䘯〰‱"2“

above is the response. 以上就是回应。

I want to convert it to readable text format like String. 我想将它转换为可读的文本格式,如String。 I am using core Java. 我使用的是核心Java。

倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2 倥䙄ㄭ㌮਍쿣ී㈊〠漠椭਍圯湩湁楳湅潣楤杮਍湥润椭਍“‰扯4㰊഼┊敄瑶灹⁥佐呓'†䘠汤⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍汇扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据挚湩'‰൒㸊ാ攊摮扯4㐊〠漠椭਍㰼਍湥瑧慥慥慥慥慥䘯7䘯〰‱⸱2

That's a PDF file that has been interpreted as UTF-16LE. 这是一个被解释为UTF-16LE的PDF文件。

You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. 您需要查看哪个组件正在接收响应以及它如何处理输入以阻止它被解码为UTF-16LE,但最终没有它的“可读”版本,因为它是一个二进制文件。 Extracting the document text out of a PDF file is a much bigger problem! 从PDF文件中提取文档文本是一个更大的问题!

(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.) (注意:Unicode是一个字符集,UTF-16LE是设置为字节的编码。由于历史事故,Microsoft称UTF-16LE编码为“Unicode”,但这是误导性的。)

If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with: 如果你有byte[]InputStream (两个二进制数据),你可以得到一个String或一个Reader (两个文本):

final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"

byte[] b = ...;
String s = new String(b, encoding);

InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
    String line = reader.readLine();
}

The reverse process uses: 反向过程使用:

byte[] b = s.geBytes(encoding);
OutputStream os = ...;

BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);

Unicode is a numbering system for all characters. Unicode是所有字符的编号系统。 The UTF variants implement Unicode as bytes. UTF变体将Unicode实现为字节。


Your problem: 你的问题:

In normal ways (web service), you would already have received a String . 以正常方式(Web服务),您已经收到了一个String You could write that string to a file using the Writer above for instance. 例如,您可以使用上面的Writer将该字符串写入文件。 Either to check it yourself with a full Unicode font , or to pass the file on for a check. 要么使用完整的Unicode字体自行检查,要么传递文件以进行检查。

You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. 您需要(?)来检查文本所处的UTF变体。对于亚洲脚本,UTF-16(小端或大端)是最佳的。 In XML it would be defined already. 在XML中,它已经被定义。


Addition: 加成:

FileWriter writes to a file using the default encoding (from operating system on your machine). FileWriter使用默认编码(从计算机上的操作系统)写入文件。 Instead use: 而是使用:

new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")

If it is a binary PDF, as @bobince said, use just a FileOutputStream on byte[] or InputStream. 如果它是二进制PDF,正如@bobince所说,只在byte []或InputStream上使用FileOutputStream。

This is definitely not a valid string. 这绝对不是一个有效的字符串。 This looks like mangled UTF-16. 这看起来像是受损的UTF-16。

UPDATE UPDATE

Indeed @Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. 确实@Bobince是对的,这是一个以UTF-16 显示的PDF文件(最有可能是UTF-8 /或纯ASCII)。 When Displayed in UTF-8 this string indeed shows PDF source code. 当以UTF-8显示时,此字符串确实显示PDF源代码。 Good catch. 接得好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM