简体   繁体   中英

Incorrect file encoding to parse

I have a problem parsing the file. The input file is EE windows 1250 encoded. When trying to parse it gets an error


    Exception in thread "main" java.lang.IllegalStateException: MalformedInputException reading next record: java.nio.charset.MalformedInputException: Input length = 1
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155)
        at com.test.converter.CsvConverter.processInputCSV(CsvConverter.java:148)
        at com.test.converter.CsvConverter.main(CsvConverter.java:249)
    Caused by: java.nio.charset.MalformedInputException: Input length = 1
        at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
    Caused by: java.nio.charset.MalformedInputException: Input length =

My method

public List<CSVRecord> collectAllEntries(Path path) throws IOException {
        List<CSVRecord> store = new ArrayList<>();
        try (
                Reader reader = Files.newBufferedReader(path);
                CSVParser csvParser = new CSVParser(reader, CSVFormat.EXCEL)
        ) {
            for (CSVRecord csvRecord : csvParser) {
                store.add(csvRecord);
            }
        } catch (IOException e) {
            e.printStackTrace();
            throw e;
        }
        return store;
    }

How I can fix this problem?

The problem here is you are trying to read a windows-1250 encoded file using UTF-8 . The Files.newBufferedReader(path) defaults to UTF-8 .

When you read the file, pass the encoding scheme ( windows-1250 in this case) that the file was encoded to instruct the buffered reader to use it as below;

Files.newBufferedReader(path, Charset.forName("windows-1250"));

This is a good start on encoding - https://www.baeldung.com/java-char-encoding

You have a BufferedReader using the default encoding, which is probably UTF-8.

You have a file that you have said is encoded as code page 1250.

That is the reason. Your BufferedReader needs to be told to expect that encoding it will be given.

Use the two-argument form of newBufferedReader, the one with a character-set as the second arg. I'm not sure what the right value for CP 1250 is, but that should be easy to find out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM