简体   繁体   中英

Simplest way to check if a List<CSVRecord> has duplicates in java

I want to create a function that returns a bool if the passed in List of CSVRecord objects from org.apache.commons.csv has at least one duplicate record.

public static boolean hasDuplicate(List<CSVRecord> csvRecords){
    //...
}

So far, I've tried creating a set from the input to hopefully get a set without duplicate values that I could then compare sizes with the original List, but Set did not get rid of duplicates.

    Set<CSVRecord> unique = new HashSet<CSVRecord>(csvRecords);

This is because CSVRecords has unique recordNumbers for each row in the.csv, but I only care about comparing values. So below I would consider records 2 and 3 to be duplicates, and would return true.

CSVRecord [comment='null', recordNumber=1, values=[BOB, JACKSON]]
CSVRecord [comment='null', recordNumber=2, values=[JANE, DOE]]
CSVRecord [comment='null', recordNumber=3, values=[JANE, DOE]]

Is there an efficient way to check if the List has duplicates based on values without iterating over each entry, grabbing the values, and storing them? I want the method to be able to work on any CSVRecord, regardless of the shape of the values list.

  public boolean hasDuplicate(List<CSVRecord> csvRecords){
    return csvRecords.size() == Set.of(csvRecords.stream().map(this::stringify)).size();
  }

  private String stringify(CSVRecord item) {
    return String.join(",", item.toMap().values());
  }

Putting the whole list into the set and comparing the sizes is quite inefficient. It is much better to do that lazily:

public static boolean hasDuplicate(List<CSVRecord> csvRecords) {
  Set<List<String>> set = new HashSet<>();
  for (CSVRecord rec : csvRecords) {
    List<String> lst = new ArrayList<>();
    for (String str : rec) lst.add(str);
    if (!set.add(lst)) return true; // add returns false if the entry is present
  }
  return false;
}

Unfortunately, the String array in CSVRecord class does not have public access, so we either copy it using iterator or directly access via reflection . Actually, there is the third and the most efficient way. If we give the package where our class is based the same name as CSVRecord's package, then it's package-private method values() becomes available. We then wrap the returned array using Arrays.asList() (it does not copy the contents, just holds the reference to the original array) to benefit from returned list's equals() method and, therefore, to be able to put it into the set:

package org.apache.commons.csv;

import java.util.*;

public class CSVRecordUtils {
  public static boolean hasDuplicate(List<CSVRecord> csvRecords) {
    Set<List<String>> set = new HashSet<>();
    for (CSVRecord rec : csvRecords) {
      if (!set.add(Arrays.asList(rec.values()))) {
        return true;
      }
    }
    return false;
  }
}
public static boolean hasDuplicate(List<CSVRecord> csvRecords){
    // create list of Maps where each map is one csvRecord's values
    List<Map<String, String>> recordsMaps = new ArrayList<>();
    // add records to the list of Maps
    for(CSVRecord record : csvRecords){
        recordsMaps.add(record.toMap());
    }
    // create set to hold unique records
    Set<Map<String, String>> unique = new HashSet<>();
    // add each map in recordsMaps to set. if the map already exists it won't be added
    for(Map<String, String> map : recordsMaps){
        unique.add(map);
    }

    boolean hasDuplicate = recordsMaps.size() != unique.size();
    return hasDuplicate;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM