简体   繁体   中英

All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

I'm creating a simple wordcount program in Java that reads through a directory's text-based files.

However, I keep on getting the error:

java.nio.charset.MalformedInputException: Input length = 1

from this line of code:

BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));

I know I probably get this because I used a Charset that didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.

I later learned at the JavaDocs that the Charset is optional and only used for a more efficient reading of the files, so I changed the code to:

BufferedReader reader = Files.newBufferedReader(file);

But some files still throw the MalformedInputException . I don't know why.

I was wondering if there is an all-inclusive Charset that will allow me to read text files with many different types of characters ?

Thanks.

You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException , try the next encoding.

Creating BufferedReader from Files.newBufferedReader

Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);

when running the application it may throw the following exception:

java.nio.charset.MalformedInputException: Input length = 1

But

new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));

works well.

The different is that, the former uses CharsetDecoder default action.

The default action for malformed-input and unmappable-character errors is to report them.

while the latter uses the REPLACE action.

cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)

ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-

req.setCharacterEncoding("ISO-8859-1");

I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.

I also encountered this exception with error message,

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)

and found that some strange bug occurs when trying to use

BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));

to write a String "orazg 54" cast from a generic type in a class.

//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");

This String is of length 9 containing chars with the following code points:

111 114 97 122 103 9 53 52 10

However, if the BufferedWriter in the class is replaced with:

FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));

it can successfully write this String without exceptions. In addition, if I write the same String create from the characters it still works OK.

String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();

Previously I have never encountered any Exception when using the first BufferedWriter to write any Strings. It's a strange bug that occurs to BufferedWriter created from java.nio.file.Files.newBufferedWriter(path, options)

ISO_8859_1 Worked for me! I was reading text file with comma separated values

I wrote the following to print a list of results to standard out based on available charsets. Note that it also tells you what line fails from a 0 based line number in case you are troubleshooting what character is causing issues.

public static void testCharset(String fileName) {
    SortedMap<String, Charset> charsets = Charset.availableCharsets();
    for (String k : charsets.keySet()) {
        int line = 0;
        boolean success = true;
        try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
            while (b.ready()) {
                b.readLine();
                line++;
            }
        } catch (IOException e) {
            success = false;
            System.out.println(k+" failed on line "+line);
        }
        if (success) 
            System.out.println("*************************  Successs "+k);
    }
}

try this.. i had the same issue, below implementation worked for me

Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);

then use Reader where ever you want.

foreg:

CsvToBean<anyPojo> csvToBean = null;
    try {
        Reader reader = Files.newBufferedReader(Paths.get(csvFilePath), 
                        StandardCharsets.ISO_8859_1);
        csvToBean = new CsvToBeanBuilder(reader)
                .withType(anyPojo.class)
                .withIgnoreLeadingWhiteSpace(true)
                .withSkipLines(1)
                .build();

    } catch (IOException e) {
        e.printStackTrace();
    }

Well, the problem is that Files.newBufferedReader(Path path) is implemented like this :

public static BufferedReader newBufferedReader(Path path) throws IOException {
    return newBufferedReader(path, StandardCharsets.UTF_8);
}

so basically there is no point in specifying UTF-8 unless you want to be descriptive in your code. If you want to try a "broader" charset you could try with StandardCharsets.UTF_16 , but you can't be 100% sure to get every possible character anyway.

UTF-8适合我使用波兰语字符

Adding an additional answer for quarkus mailer and qute templates, as this is always the first result in google no matter what parts of the stack trace I searched for:

If you're using quarkus mailer and a qute template and get this MalformedInputException check if your templates folder contains other files than template files. In my case I had a .png file that I wanted to include in the mail and that was automatically read as template, therefore this encoding issue appeared.

you can try something like this, or just copy and past below piece.

boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.        
int index = 0;

while(exception) {
    try {
        lines = Files.readAllLines(f.toPath(),charset);
          for (String line: lines) {
              line= line.trim();
              if(line.contains(keyword))
                  values.add(line);
              }           
        //No exception, just returns
        exception = false; 
    } catch (IOException e) {
        exception = true;
        //Try the next charset
        if(index<Charset.availableCharsets().values().size())
            charset = (Charset) Charset.availableCharsets().values().toArray()[index];
        index ++;
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM