简体   繁体   中英

Convert byte array to String - java

I am trying to read in the content of a file to any readable form. I am using a FileInputStream to read from the file to a byte array, and then am trying to convert that byte array into a String.

So far, I have tried 3 different ways:

FileInputStream inputStream = new FileInputStream(file);
byte[] clearTextBytes = new byte[(int) file.length()];
inputStream.read(clearTextBytes);

String s = IOUtils.toString(inputStream); //first way

String str = new String(clearTextBytes, "UTF-8"); //second way

String string = Arrays.toString(clearTextBytes); //third way
String[] byteValue = string.substring(1, string.length() - 1).split(",");
byte[] bytes = new byte[byteValue.length]
for(int i=0, len=bytes.length; i<len; i++){
   bytes[i] = Byte.parseByte(byteValue[i].trim());
}
String newStr = new String(bytes);

When I print out each of the Strings: 1) prints out nothing, and 2 & 3) print out a lot of weird characters, such as: PK! Q [Content_Types].xml ( MO @ & f ] pP<* v ݏ ,_ i I (zi N }fڝ h 5) & 6Sf c| " d R d Eo r l :0Tɭ "Э p'䧘 tn & q(=X !. , _ WF L8W......

I would love any advice on how to properly convert my byte array to a String.

As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. Note files which start with PK could be in PKZIP format and the randomness of your data does suggest it could be compressed. http://www.garykessler.net/library/file_sigs.html Try making the renaming the file to have .ZIP at the end and see if you can open it in file explorer.

From the link above, the start of a DOCX file looks as follows.

50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX

 Microsoft Office Open XML Format (OOXML) Document NOTE: There is no subheader for MS OOXML files as there is with DOC, PPT, and XLS files. To better understand the format of these files, rename any OOXML file to have a .ZIP extension and then unZIP the file; look at the resultant file named [Content_Types].xml to see the content types. In particular, look for the <Override PartName= tag, where you will find word, ppt, or xl, respectively. Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes at the end of the file. 

Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input.

You can try brute force to find a character set which doesn't produce any unknown characters.

public static Set<Charset> possibleCharsets(byte[] bytes) {
    Set<Charset> charsets = new LinkedHashSet<>();
    for (Charset charset : Charset.availableCharsets().values()) {
        if (!new String(bytes, charset).contains("�"))
            charsets.add(charset);
    }
    return charsets;
}

UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. Try the classic dos codepage instead:

new String(clearTextBytes, "DOS-US");

Check this out for getting text contents of word file: You'd need Apache POI libraries.

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

[...]

   XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));       
   XWPFWordExtractor we = new XWPFWordExtractor(docx);
   System.out.println(we.getText());

I've written a very basic program to read the contents of a file and to print each string on a new line in the console. Here is the content of the file:

FILE1.TXT

Here is the program I wrote:

import java.io.*;
import java.util.*;

class Test {
    public static void main(String args[]) throws FileNotFoundException {
        File file = new File("File1.txt");
        Scanner input = new Scanner(file);

        while (input.hasNext()) {
            System.out.println(input.next());
        }

        input.close();

    } // main()
} // class Test

This is the output to the console:

apples
pears
1
2
3
oranges
carrots
bananas
pineapples

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM