将字节数组转换为字符串-Java

Question

I am trying to read in the content of a file to any readable form. 我正在尝试以任何可读形式读取文件的内容。 I am using a FileInputStream to read from the file to a byte array, and then am trying to convert that byte array into a String. 我正在使用FileInputStream从文件读取到字节数组，然后尝试将该字节数组转换为String。

So far, I have tried 3 different ways: 到目前为止，我已经尝试了3种不同的方式：

FileInputStream inputStream = new FileInputStream(file);
byte[] clearTextBytes = new byte[(int) file.length()];
inputStream.read(clearTextBytes);

String s = IOUtils.toString(inputStream); //first way

String str = new String(clearTextBytes, "UTF-8"); //second way

String string = Arrays.toString(clearTextBytes); //third way
String[] byteValue = string.substring(1, string.length() - 1).split(",");
byte[] bytes = new byte[byteValue.length]
for(int i=0, len=bytes.length; i<len; i++){
   bytes[i] = Byte.parseByte(byteValue[i].trim());
}
String newStr = new String(bytes);

When I print out each of the Strings: 1) prints out nothing, and 2 & 3) print out a lot of weird characters, such as: PK! Q [Content_Types].xml ( MO @ & f ] pP<* v ݏ ,_ i I (zi N }fڝ h 5) & 6Sf c| " d R d Eo r l :0Tɭ "Э p'䧘 tn & q(=X !. , _ WF L8W...... 当我打印出每个字符串时：1）不打印任何内容，2＆3）打印很多奇怪的字符，例如：PK！Q。[Content_Types] .xml``（MO。 @ ＆ f ] pP pP<* v ݏ ,_ i I (zi N }fڝ h 5）＆ 6Sf c|。“。d.R.d.Eo.r.l.l ...：0Tɭ。”Э.p'䧘tnn。（= X！，_ WFL8W ......

I would love any advice on how to properly convert my byte array to a String. 我希望获得有关如何正确将字节数组转换为字符串的任何建议。

Answer 1

As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. 正如其他人指出的那样，数据看起来不像它包含任何文本，因此很有可能是二进制数据，而不是文本。 Note files which start with PK could be in PKZIP format and the randomness of your data does suggest it could be compressed. 以PK开头的注释文件可能是PKZIP格式，并且数据的随机性确实表明它可以被压缩。 http://www.garykessler.net/library/file_sigs.html Try making the renaming the file to have .ZIP at the end and see if you can open it in file explorer. http://www.garykessler.net/library/file_sigs.html尝试使文件重命名以.ZIP结尾，并查看是否可以在文件资源管理器中打开它。

From the link above, the start of a DOCX file looks as follows. 从上面的链接，DOCX文件的开始看起来如下。

50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX 50 4B 03 04 14 00 06 00 PK ...... DOCX，PPTX，XLSX

 Microsoft Office Open XML Format (OOXML) Document NOTE: There is no subheader for MS OOXML files as there is with DOC, PPT, and XLS files. To better understand the format of these files, rename any OOXML file to have a .ZIP extension and then unZIP the file; look at the resultant file named [Content_Types].xml to see the content types. In particular, look for the <Override PartName= tag, where you will find word, ppt, or xl, respectively. Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes at the end of the file.

Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. 假设您有文本数据，则很可能字符编码不是您的默认字符，也不是UTF-8。 You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input. 您需要a）检查编码是什么，b）当输出字符串而不是输入字符串时，检查是否不是损坏。

You can try brute force to find a character set which doesn't produce any unknown characters. 您可以尝试用蛮力找到不会产生任何未知字符的字符集。

public static Set<Charset> possibleCharsets(byte[] bytes) {
    Set<Charset> charsets = new LinkedHashSet<>();
    for (Charset charset : Charset.availableCharsets().values()) {
        if (!new String(bytes, charset).contains("�"))
            charsets.add(charset);
    }
    return charsets;
}

Answer 2

UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. UTF8可以容纳大约2,097,152个不同的字符，这些字符没有图像，您会看到问号。 Try the classic dos codepage instead: 请尝试使用经典的dos代码页：

new String(clearTextBytes, "DOS-US");

Answer 3

Check this out for getting text contents of word file: You'd need Apache POI libraries. 检查一下以获取Word文件的文本内容：您需要Apache POI库。

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

[...]

   XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));       
   XWPFWordExtractor we = new XWPFWordExtractor(docx);
   System.out.println(we.getText());

Answer 4

I've written a very basic program to read the contents of a file and to print each string on a new line in the console. 我编写了一个非常基本的程序来读取文件的内容，并将每个字符串打印在控制台的新行中。 Here is the content of the file: 这是文件的内容：

Here is the program I wrote: 这是我编写的程序：

import java.io.*;
import java.util.*;

class Test {
    public static void main(String args[]) throws FileNotFoundException {
        File file = new File("File1.txt");
        Scanner input = new Scanner(file);

        while (input.hasNext()) {
            System.out.println(input.next());
        }

        input.close();

    } // main()
} // class Test

This is the output to the console: 这是控制台的输出：

apples
pears
1
2
3
oranges
carrots
bananas
pineapples

将字节数组转换为字符串-Java

问题描述

4 个解决方案

解决方案1
4 2015-12-01 13:17:19

解决方案2
0 2015-12-01 13:24:04

解决方案3
0 2015-12-01 13:24:31

解决方案4
0 2015-12-01 13:51:16

将字节数组转换为字符串-Java

问题描述

4 个解决方案

解决方案1 4 2015-12-01 13:17:19

解决方案2 0 2015-12-01 13:24:04

解决方案3 0 2015-12-01 13:24:31

解决方案4 0 2015-12-01 13:51:16

解决方案1
4 2015-12-01 13:17:19

解决方案2
0 2015-12-01 13:24:04

解决方案3
0 2015-12-01 13:24:31

解决方案4
0 2015-12-01 13:51:16