readNext() function of CSVReader not looping through all rows of csv [EDIT: How to handle erroneous CSV (remove unescaped quotes)]

Question

FileReader fr = new FileReader(inp);
CSVReader reader = new CSVReader(fr, ',', '"');

// writer
File writtenFromWhile = new File(dliRootPath + writtenFromWhilePath);
writtenFromWhile.createNewFile();
CSVWriter writeFromWhile = new CSVWriter(new FileWriter(writtenFromWhile), ',', '"');

int insideWhile = 0;
String[] currRow = null;
while ((currRow = reader.readNext()) != null) {
    insideWhile++;
    writeFromWhile.writeNext(currRow);
}
System.out.println("inside While: " + insideWhile);
System.out.println("lines read (acc.to CSV reader): " + reader.getLinesRead());

The output is:

inside While: 162199
lines read (acc.to CSV reader): 256865

Even though all lines are written to the output CSV (when viewed in a text editor, Excel shows much lesser number of rows), the while loop does not iterate the same number of times as the rows in input CSV. My main objective is to implement some other logic inside while loop on each line. I have been trying to debug since two whole days ( a bigger code) without any results.

Please explain how I can loop through while 256865 times

Reference data, complete picture :

Here is the CSV I am reading in the above snippet.

My complete program tries to separate out those records from this CSV which are not present in this CSV , based on the fields title and author (ie if author and title is the same in 2 records, even if other fields are different, they are counted as duplicate and should not be written to the output file). Here is my complete code (the difference should be around 300000, but i get only ~210000 in the output file with my code):

//TODO ask id
/*(*
 * id also there in fields getting matched (thisRow[0] is id)
 * u can replace it by thisRow[fielAnd Column.get(0)] to eliminate  id
 */

package mainOne;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import com.opencsv.CSVReader;
import com.opencsv.CSVWriter;

public class Diff_V3 {
    static String dliRootPath = "/home/gurnoor/Incoming/Untitled Folder 2/";
    static String dli = "new-dli-IITG.csv";
    static String oldDli = "dli-iisc.csv";
    static String newFile = "newSampleFile.csv";// not used
    static String unqFile = "UniqueFileFinal.csv";
    static String log = "Diff_V3_log.txt";
    static String splittedNewDliDir = "/home/gurnoor/Incoming/Untitled Folder 2/splitted new file";
    static String splittedOldDliDir = "/home/gurnoor/Incoming/Untitled Folder 2/splitted old file";

    // debug
    static String testFilePath = "testFile.csv";
    static int insidepopulateMapFromSplittedCSV = 0;

    public static void main(String[] args) throws IOException, CustomException {
        // _readSample(dliRootPath+dli, dliRootPath+newFile);
        // System.out.println(areIDsunique(dliRootPath + dli, 550841) );// open
        // in geany to get total no
        // of lines

        // TODO implement sparate function to check equals
        // File filteredFile = new File(dliRootPath + "filteredFile.csv");
        // filteredFile.createNewFile();
        File logFile = new File(dliRootPath + log);
        logFile.createNewFile();
        new File(dliRootPath + testFilePath).createNewFile();
        List<String> fieldsToBeMatched = new ArrayList<>();
        fieldsToBeMatched.add("dc.contributor.author[]");
        fieldsToBeMatched.add("dc.title[]");
        filterUniqueFileds(new File(splittedNewDliDir), new File(splittedOldDliDir), fieldsToBeMatched);

    }

    /**
     * NOTE: might remove the row where fieldToBeMatched is null
     * 
     * @param inpfile
     * @param file
     * @param filteredFile
     * @param fieldsToBeMatched
     * @throws IOException
     * @throws CustomException
     */
    private static void filterUniqueFileds(File newDir, File oldDir, List<String> fieldsToBeMatched)
            throws IOException, CustomException {

        CSVReader reader = new CSVReader(new FileReader(new File(dliRootPath + dli)), '|');
        // writer
        File unqFileOp = new File(dliRootPath + unqFile);
        unqFileOp.createNewFile();
        CSVWriter writer = new CSVWriter(new FileWriter(unqFileOp), '|');

        // logWriter
        BufferedWriter logWriter = new BufferedWriter(new FileWriter(new File(dliRootPath + log)));

        String[] headingRow = // allRows.get(0);
        reader.readNext();
        writer.writeNext(headingRow);
        int headingLen = headingRow.length;

        // old List
        System.out.println("[INFO] reading old list...");
        // CSVReader oldReader = new CSVReader(new FileReader(new
        // File(dliRootPath + oldDli)));
        Map<String, List<String>> oldMap = new HashMap<>();
        oldMap = populateMapFromSplittedCSV(oldMap, oldDir);// populateMapFromCSV(oldMap,
                                                            // oldReader);
        // oldReader.close();
        System.out.println("[INFO] Read old List. Size = " + oldMap.size());
        printMapToCSV(oldMap, dliRootPath + testFilePath);

        // map of fieldName, ColumnNo
        Map<String, Integer> fieldAndColumnNoInNew = new HashMap<>(getColumnNo(fieldsToBeMatched, headingRow));
        Map<String, Integer> fieldAndColumnNoInOld = new HashMap<>(
                getColumnNo(fieldsToBeMatched, (String[]) oldMap.get("id").toArray()));
        // error check: did columnNo get populated?
        if (fieldAndColumnNoInNew.isEmpty()) {
            reader.close();
            writer.close();
            throw new CustomException("field to be matched not present in input CSV");
        }

        // TODO implement own array compare using areEqual()
        // error check
        // if( !Arrays.equals(headingRow, (String[]) oldMap.get("id").toArray())
        // ){
        // System.out.println("heading in new file, old file: \n"+
        // Arrays.toString(headingRow));
        // System.out.println(Arrays.toString((String[])
        // oldMap.get("id").toArray()));
        // reader.close();
        // writer.close();
        // oldReader.close();
        // throw new CustomException("Heading rows are not same in old and new
        // file");
        // }

        int noOfRecordsInOldList = 0, noOfRecordsWritten = 0, checkManually = 0;
        String[] thisRow;
        while ((thisRow = reader.readNext()) != null) {
            // for(int l=allRows.size()-1; l>=0; l--){
            // thisRow=allRows.get(l);

            // error check
            if (thisRow.length != headingLen) {
                String error = "Line no: " + reader.getLinesRead() + " in file: " + dliRootPath + dli
                        + " not read. Check manually";

                System.err.println(error);
                logWriter.append(error + "\n");
                logWriter.flush();
                checkManually++;
                continue;
            }

            // write if not present in oldMap
            if (!oldMap.containsKey(thisRow[0])) {
                writer.writeNext(thisRow);
                writer.flush();
                noOfRecordsWritten++;
            } else {
                // check if all reqd fields match
                List<String> twinRow = oldMap.get(thisRow[0]);
                boolean writtenToOp = false;
                // for (int k = 0; k < fieldsToBeMatched.size(); k++) {
                List<String> newFields = new ArrayList<>(fieldAndColumnNoInNew.keySet());
                List<String> oldFields = new ArrayList<>(fieldAndColumnNoInOld.keySet());
                // faaltu error check
                if (newFields.size() != oldFields.size()) {
                    reader.close();
                    writer.close();
                    CustomException up = new CustomException("something is really wrong");
                    throw up;
                }
                // for(String fieldName : fieldAndColumnNoInNew.keySet()){
                for (int m = 0; m < newFields.size(); m++) {
                    int columnInNew = fieldAndColumnNoInNew.get(newFields.get(m)).intValue();
                    int columnInOld = fieldAndColumnNoInOld.get(oldFields.get(m)).intValue();
                    String currFieldTwin = twinRow.get(columnInOld);
                    String currField = thisRow[columnInNew];
                    if (!areEqual(currField, currFieldTwin)) {
                        writer.writeNext(thisRow);
                        writer.flush();
                        writtenToOp = true;
                        noOfRecordsWritten++;
                        System.out.println(noOfRecordsWritten);
                        break;
                    }
                }
                if (!writtenToOp) {
                    noOfRecordsInOldList++;
                    // System.out.println("[INFO] present in old List: \n" +
                    // Arrays.toString(thisRow) + " AND\n"
                    // + twinRow.toString());
                }
            }
        }
        System.out.println("--------------------------------------------------------\nDebug info");
        System.out.println("old File: " + oldMap.size());
        System.out.println("new File:" + reader.getLinesRead());

        System.out.println("no of records in old list (present in both old and new) = " + noOfRecordsInOldList);
        System.out.println("checkManually: " + checkManually);
        System.out.println("noOfRecordsInOldList+checkManually = " + (noOfRecordsInOldList + checkManually));
        System.out.println("no of records written = " + noOfRecordsWritten);
        System.out.println();
        System.out.println("inside populateMapFromSplittedCSV() " + insidepopulateMapFromSplittedCSV + "times");

        logWriter.close();
        reader.close();
        writer.close();

    }

    private static void printMapToCSV(Map<String, List<String>> oldMap, String testFilePath2) throws IOException {
        // writer
        int i = 0;
        CSVWriter writer = new CSVWriter(new FileWriter(new File(testFilePath2)), '|');
        for (String key : oldMap.keySet()) {
            List<String> row = oldMap.get(key);
            String[] tempRow = new String[row.size()];
            tempRow = row.toArray(tempRow);
            writer.writeNext(tempRow);
            writer.flush();
            i++;
        }
        writer.close();
        System.out.println("[hello from line 210 ( inside printMapToCSV() ) of ur code] wrote " + i + " lines");
    }

    private static Map<String, List<String>> populateMapFromSplittedCSV(Map<String, List<String>> oldMap, File oldDir)
            throws IOException {

        File defective = new File(dliRootPath + "defectiveOldFiles.csv");
        defective.createNewFile();
        CSVWriter defectWriter = new CSVWriter(new FileWriter(defective));

        CSVReader reader = null;
        for (File oldFile : oldDir.listFiles()) {
            insidepopulateMapFromSplittedCSV++;
            reader = new CSVReader(new FileReader(oldFile), ',', '"');
            oldMap = populateMapFromCSV(oldMap, reader, defectWriter);
            // printMapToCSV(oldMap, dliRootPath+testFilePath);
            System.out.println(oldMap.size());
            reader.close();
        }
        defectWriter.close();
        System.out.println("inside populateMapFromSplittedCSV() " + insidepopulateMapFromSplittedCSV + "times");
        return new HashMap<String, List<String>>(oldMap);
    }

    private static Map<String, Integer> getColumnNo(List<String> fieldsToBeMatched, String[] headingRow) {
        Map<String, Integer> fieldAndColumnNo = new HashMap<>();
        for (String field : fieldsToBeMatched) {
            for (int i = 0; i < headingRow.length; i++) {
                String heading = headingRow[i];
                if (areEqual(field, heading)) {
                    fieldAndColumnNo.put(field, Integer.valueOf(i));
                    break;
                }
            }
        }
        return fieldAndColumnNo;
    }

    private static Map<String, List<String>> populateMapFromCSV(Map<String, List<String>> oldMap, CSVReader oldReader,
            CSVWriter defectWriter) throws IOException {
        int headingLen = 0;
        List<String> headingRow = null;
        if (oldReader.getLinesRead() > 1) {
            headingRow = oldMap.get("id");
            headingLen = headingRow.size();
        }
        String[] thisRow;
        int insideWhile = 0, addedInMap = 0, doesNotContainKey = 0, containsKey = 0;
        while ((thisRow = oldReader.readNext()) != null) {

            // error check
            // if (oldReader.getLinesRead() > 1) {
            // if (thisRow.length != headingLen) {
            // System.err.println("Line no: " + oldReader.getLinesRead() + " in
            // file: " + dliRootPath + oldDli
            // + " not read. Check manually");
            // defectWriter.writeNext(thisRow);
            // defectWriter.flush();
            // continue;
            // }
            // }

            insideWhile++;
            if (!oldMap.containsKey(thisRow[0])) {
                doesNotContainKey++;
                List<String> fullRow = Arrays.asList(thisRow);
                fullRow = oldMap.put(thisRow[0], fullRow);
                if (fullRow == null) {
                    addedInMap++;
                }
            } else {
                List<String> twinRow = oldMap.get(thisRow[0]);
                boolean writtenToOp = false;

                // for(String fieldName : fieldAndColumnNoInNew.keySet()){
                for (int m = 0; m < headingRow.size(); m++) {

                    String currFieldTwin = twinRow.get(m);
                    String currField = thisRow[m];
                    if (!areEqual(currField, currFieldTwin)) {
                        System.err.println("do something!!!!!!  DUPLICATE ID in old file");
                        containsKey++;
                        FileWriter logWriter = new FileWriter(new File((dliRootPath + log)));
                        System.err.println("[Skipped record] in old file. Row no: " + oldReader.getLinesRead()
                                + "\nRecord: " + Arrays.toString(thisRow));
                        logWriter.append("[Skipped record] in old file. Row no: " + oldReader.getLinesRead()
                                + "\nRecord: " + Arrays.toString(thisRow));
                        logWriter.close();
                        break;
                    }
                }

            }
        }
        System.out.println("inside while:      " + insideWhile);
        System.out.println("oldMap size =      " + oldMap.size());
        System.out.println("addedInMap:        " + addedInMap);
        System.out.println("doesNotContainKey: " + doesNotContainKey);
        System.out.println("containsKey:       " + containsKey);

        return new HashMap<String, List<String>>(oldMap);

    }

    private static boolean areEqual(String field, String heading) {
        // TODO implement, askSubhayan

        return field.trim().equals(heading.trim());
    }

    /**
     * Returns the first duplicate ID OR the string "unique" OR (rarely)
     * totalLinesInCSV != totaluniqueIDs
     * 
     * @param inpCSV
     * @param totalLinesInCSV
     * @return
     * @throws IOException
     */
    private static String areIDsunique(String inpCSV, int totalLinesInCSV) throws IOException {
        CSVReader reader = new CSVReader(new FileReader(new File(dliRootPath + dli)), '|');
        List<String[]> allRows = new ArrayList<>(reader.readAll());
        reader.close();
        Set<String> id = new HashSet<>();
        for (String[] thisRow : allRows) {
            if (thisRow[0] != null || !thisRow[0].isEmpty() || id.add(thisRow[0])) {
                return thisRow[0];
            }
        }
        if (id.size() == totalLinesInCSV) {
            return "unique";
        } else {
            return "totalLinesInCSV != totaluniqueIDs";
        }
    }

    /**
     * writes 20 rowsof input csv into the output file
     * 
     * @param input
     * @param output
     * @throws IOException
     */
    public static void _readSample(String input, String output) throws IOException {
        File opFile = new File(dliRootPath + newFile);
        opFile.createNewFile();
        CSVWriter writer = new CSVWriter(new FileWriter(opFile));

        CSVReader reader = new CSVReader(new FileReader(new File(dliRootPath + dli)), '|');
        for (int i = 0; i < 20; i++) {
            // String[] op;
            // for(String temp: reader.readNext()){
            writer.writeNext(reader.readNext());
            // }
            // System.out.println();
        }
        reader.close();
        writer.flush();
        writer.close();
    }

}

Answer 1

RC's comment nailed it!

If you check the java docs you will see that there are two methods in the CSVReader: getLinesRead and getRecordsRead. And they both do exactly what they say. getLinesRead returns the number of lines that was read using the FileReader. getRecordsRead returns the number of records that the CSVReader read. Keep in mind that if you have embedded new lines in the records of your file then it will take multiple line reads to get one record. So it is very conceivable to have a csv file with 100 records but taking 200 line reads to read them all.

Answer 2

Unescaped quotes inside a CSV cell can mess up your whole data. This might happen in a CSV if the data you are working with has been created manually. Below is a function I wrote a while back for this situation. Let me know if this is not the right place to share it.

/**
 * removes quotes inside a cell/column puts curated data in
 * "../CuratedFiles"
 * 
 * @param curateDir
 * @param del Csv column delimiter
 * @throws IOException
 */
public static void curateCsvRowQuotes(File curateDir, String del) throws IOException {
    File parent = curateDir.getParentFile();
    File curatedDir = new File(parent.getAbsolutePath() + "/CuratedFiles");
    curatedDir.mkdir();
    for (File file : curateDir.listFiles()) {
        BufferedReader bufRead = new BufferedReader(new FileReader(file));

        // output
        File fOp = new File(curatedDir.getAbsolutePath() + "/" + file.getName());
        fOp.createNewFile();
        BufferedWriter bufW = new BufferedWriter(new FileWriter(fOp));

        bufW.append(bufRead.readLine() + "\n");// heading

        // logs
        File logFile = new File(curatedDir.getAbsolutePath() + "/CurationLogs.txt");
        logFile.createNewFile();
        BufferedWriter logWriter = new BufferedWriter(new FileWriter(logFile));

        String thisLine = null;
        int lineCount = 0;
        while ((thisLine = bufRead.readLine()) != null) {

            String opLine = "";
            int endIndex = thisLine.indexOf("\"" + del);
            String str = thisLine.substring(0, endIndex);
            opLine += str + "\"" + del;
            while (endIndex != (-1)) {
                // leave out first " in a cell
                int tempIndex = thisLine.indexOf("\"" + del, endIndex + 2);
                if (tempIndex == (-1)) {
                    break;
                }
                str = thisLine.substring(endIndex + 2, tempIndex);
                int indexOfQuote = str.indexOf("\"");
                opLine += str.substring(0, indexOfQuote + 1);

                // remove all "
                str = str.substring(indexOfQuote + 1);
                str = str.replace("\"", "");
                opLine += str + "\"" + del;
                endIndex = thisLine.indexOf("\"" + del, endIndex + 2);
            }
            str = thisLine.substring(thisLine.lastIndexOf("\"" + del) + 2);
            if ((str != null) && str.matches("[" + del + "]+")) {
                opLine += str;
            }

            System.out.println(opLine);
            bufW.append(opLine + "\n");
            bufW.flush();
            lineCount++;

        }
        System.out.println(lineCount + " no of lines  in " + file.getName());
        bufRead.close();
        bufW.close();
    }
}

Answer 3

In my case, I've used csvReader.readAll() before the readNext().

Like

 List<String[]> myData =csvReader.readAll();


            while ((nextRecord = csvReader.readNext()) != null) {
}

So my csvReader.readNext() returns always null. Since all the values were already read by myData.

Please be caution for using readNext() and readAll() functions.

readNext() function of CSVReader not looping through all rows of csv [EDIT: How to handle erroneous CSV (remove unescaped quotes)]

Question

3 answers

solution1
1 ACCPTED 2015-12-20 23:15:24

solution2
0 2016-01-30 11:54:16

solution3
0 2018-01-08 12:02:17

readNext() function of CSVReader not looping through all rows of csv [EDIT: How to handle erroneous CSV (remove unescaped quotes)]

Question

3 answers

solution1 1 ACCPTED 2015-12-20 23:15:24

solution2 0 2016-01-30 11:54:16

solution3 0 2018-01-08 12:02:17

solution1
1 ACCPTED 2015-12-20 23:15:24

solution2
0 2016-01-30 11:54:16

solution3
0 2018-01-08 12:02:17