简体   繁体   中英

Finding duplicates in List<string>

In a list with some hundred thousand entries, how does one go about comparing each entry with the rest of the list for duplicates? For example, List fileNames contains both "00012345.pdf" and "12345.pdf" and are considered duplicte. What is the best strategy to flagging this kind of a duplicate?

Thanks

Update: The naming of files is restricted to numbers. They are padded with zeros. Duplicates are where the padding is missing. Thus, "123.pdf" & "000123.pdf" are duplicates.

You probably want to implement your own substring comparer to test equality based on whether a substring is contained within another string.

This isn't necessarily optimised, but it will work. You could also possibly consider using Parallel Linq if you are using .NET 4.0.

EDIT: Answer updated to reflect refined question after it was edited

void Main()
{
    List<string> stringList = new List<string> { "00012345.pdf","12345.pdf","notaduplicate.jpg","3453456363234.jpg"};

    IEqualityComparer<string> comparer = new NumericFilenameEqualityComparer ();

    var duplicates = stringList.GroupBy (s => s, comparer).Where(grp => grp.Count() > 1);

    // do something with grouped duplicates...

}

// Not safe for null's !
// NB do you own parameter / null checks / string-case options etc !
public class NumericFilenameEqualityComparer : IEqualityComparer<string> {

   private static Regex digitFilenameRegex = new Regex(@"\d+", RegexOptions.Compiled);

   public bool Equals(string left, string right) {

        Match leftDigitsMatch = digitFilenameRegex.Match(left);
        Match rightDigitsMatch = digitFilenameRegex.Match(right);

        long leftValue = leftDigitsMatch.Success ? long.Parse(leftDigitsMatch.Value) : long.MaxValue;
        long rightValue = rightDigitsMatch.Success ? long.Parse(rightDigitsMatch.Value) : long.MaxValue;

        return leftValue == rightValue;
   }

   public int GetHashCode(string value) {
        return base.GetHashCode();
   }

}

I understand you are looking for duplicates in order to remove them?

One way to go about it could be the following:

Create a class MyString which takes care of duplication rules. That is, overrides Equals and GetHashCode to recreate exactly the duplication rules you are considering. (I'm understanding from your question that 00012345.pdf and 12345.pdf should be considered duplicates?)

Make this class explicitly or implictly convertible to string (or override ToString() for that matter).

Create a HashCode<MyString> and fill it up iterating through your original List<String> checking for duplicates.

Might be dirty but it will do the trick. The only "hard" part here is correctly implementing your duplication rules.

I have a simple solution for everyone to find a duplicate string word and cahracter For word

public class Test { 
    public static void main(String[] args) {
        findDuplicateWords("i am am a a learner learner learner");
    }
    private static void findDuplicateWords(String string) {
        HashMap<String,Integer> hm=new HashMap<>();
        String[] s=string.split(" ");
        for(String tempString:s){
            if(hm.get(tempString)!=null){
                hm.put(tempString, hm.get(tempString)+1);
            }
            else{
            hm.put(tempString,1);
        }
        }
        System.out.println(hm);
    }
}

for character use for loop, get array length and use charAt()

Maybe somthing like this:

List<string> theList = new List<string>() { "00012345.pdf", "00012345.pdf", "12345.pdf", "1234567.pdf", "12.pdf" };

theList.GroupBy(txt => txt)
        .Where(grouping => grouping.Count() > 1)
        .ToList()
        .ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these     values {2}",
                                                 groupItem.Key,
                                                 groupItem.Count(),
                                                 string.Join(" ", groupItem.ToArray())));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM