简体   繁体   中英

Apache commons CSV ignore corrupted or invalid records in a csv file and continue parsing

I am trying to parse an almost valid CSV file containing data that is 99.9% correct and valid. However halfway through there are a couple of records that are invalid (too many quotes) eg

a,b,"c",d 
a,b,""c""",d

My code

    try (Reader reader = new BufferedReader(new FileReader(file), BUFFERED_READER_SIZE);
         CSVParser csvParser = new CSVParser(reader, CSVFormat.EXCEL)
    ) {
        Iterator<CSVRecord> iterator = csvParser.iterator();
        CSVRecord record;
        while (iterator.hasNext()) {
            try {
                record = iterator.next();
            } catch (IllegalStateException e) {
            }
        }
    } catch (IOException e) {
    }

How do I parse a CSV so that when it encounters an invalid row/record it just skips it and moves on to the next line?

I don't think you can do much to work around it. CSVParser is a final class and does not let controlling the way it parses the underlying stream. However, it is sort possible to work around it by having a custom iterator that would do the trick.

public final class Csv {

    private Csv() {
    }

    public interface ICsvParserFactory {

        @Nonnull
        CSVParser createCsvParser(@Nonnull Reader lineReader);

    }

    public static Stream<CSVRecord> tryParseLinesLeniently(final BufferedReader bufferedReader, final ICsvParserFactory csvParserFactory) {
        return bufferedReader.lines()
                .map(line -> {
                    try {
                        return csvParserFactory.createCsvParser(new StringReader(line))
                                .iterator()
                                .next();
                    } catch ( final IllegalStateException ex ) {
                        return null;
                    }
                })
                .filter(Objects::nonNull)
                .onClose(() -> {
                    try {
                        bufferedReader.close();
                    } catch ( final IOException ex ) {
                        throw new RuntimeException(ex);
                    }
                });
    }

}

However, I don't think it's a good idea in any case:

  • It cannot return a CSVParser instance.
  • It might return an Iterator<CSVRecord> instead of Stream<CSVRecord> (and save of the filter operation) but I just find streams more simple to implement.
  • It creates a new CSV parser for each line, therefore this method creates many objects for a CSV document that contains "too many" lines. The string reader can be probably made reusable.
  • The whole idea of the method is that it, not being a CSV parser, assumes that each lines holds one line only (I don't really remember if CSV/TSV allow multiline records), so it violates CSV parsing rules just by design. It does not support headers yet (but can be easily improved).
final Csv.ICsvParserFactory csvParserFactory = lineReader -> {
    try {
        return new CSVParser(lineReader, CSVFormat.EXCEL);
    } catch ( final IOException ex ) {
        throw new RuntimeException(ex);
    }
};
try ( final Stream<CSVRecord> csvRecords = Csv.tryParseLinesLeniently(new BufferedReader(reader), csvParserFactory) ) {
    csvRecords.forEachOrdered(System.out::println);
}

If possible, please let your CSV parser consume valid CSV documents not using any workarounds like this one.


Edit 1

There is an implementation flaw in the code above: ALL records returned from the stream now have the recordNumber set to 1 .

Now I do believe the request cannot be fixed using the Apache Commons CSV parser, since the only CSVRecord constructor is also package-private and cannot be instantiated outside that package if not using either reflection or intruding to its declaring package.

Sorry you have either fix your CSV documents, or use another parser that can parse "more leniently".

I am using Apache CSV commons version 1.9.0 and I am able to continue retrieving rows after the invalid rows by simply "absorbing" the exception and just continuing. Keep in mind that the hasNext() method actually pre-fetches the next row, so it can throw the IllegalStateException as well as the next() method.

If you absorb the exception, the next CSVRecord retrieved will be a mangled version of the invalid row, so you will want to skip it. I cannot post my code as it is the IP of my employer, but hopefully this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM