简体   繁体   中英

remove duplicates in a datatable

I have the following implementation to find duplicates in a DataTable . It is highly inefficient and takes forever on about 20K rows. I only have to find duplicate entries for the second column values:

private List<string> checkForDuplicates(DataTable results)
{
    List<string> duplicateLists = new List<string>();
    for (int i = 0; i < results.Rows.Count; i++ )
    {
        string cellvalue = results.Rows[i][1].ToString();
        for (int j = 0; j < results.Rows.Count; j++)
        {
            if (i != j)
            {
             if (cellvalue.Equals(results.Rows[j][1]))
                {
                    //Duplicate found                            
                    duplicateLists.Add(results.Rows[i][1].ToString() + "_" + i+2 + "_" + j+2);
                }
            }
        }

    }
    return duplicateLists;
}

One optimisation you could make is to do the de-duplication on a sorted data set. Define a DataView which sorts the data on the relevant column, then simply check that the current row's value is not the same as the previous row's value.

Mark Sowul's answer might be a better idea if you aren't bothered about physically removing the rows however.

The problem you've got is that every row has to check every other row, so with more rows the number of checks goes up exponentially. The quickest way to handle it is to make it linear - only do as many checks as there are rows.

One way to do this is to sort the data table by column2. This will put any duplicates in adjacent rows, so then you just need to run through the table checking that one row doesn't match the next one.

The other way is to get things at source and make sure the rows are distinct before you read them.

Use a Dictionary and iterate once over all values and count the occurence of each value => Dictionary key is the column value, Dictionary value is the count. Then return all keys where count is more than one.

From: http://social.msdn.microsoft.com/Forums/en/adodotnetdataset/thread/ed9c6a6a-a93e-4bf5-a892-d8471b84aa3b

DataTable distinctTable = originalTable.DefaultView.ToTable( /*distinct*/ true);

For your purposes you could make a DataView that includes only the column(s) you're interested in.

SQL would be much more efficient way of doing this rather than pulling the entire dataset twice.

You can do it very quickly if you have an index on the column you are referring to.

Just do

SELECT id AS matchID, column1 FROM table1 WHERE column1 IN (SELECT column1 FROM table1 WHERE id IS NOT matchId)

or something like that

Cheers, Niko

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM