简体   繁体   English

删除数据表中的重复项

[英]remove duplicates in a datatable

I have the following implementation to find duplicates in a DataTable . 我有以下实现在DataTable中查找重复项。 It is highly inefficient and takes forever on about 20K rows. 它效率极低,并且永久占用约2万行。 I only have to find duplicate entries for the second column values: 我只需要查找第二列值的重复条目:

private List<string> checkForDuplicates(DataTable results)
{
    List<string> duplicateLists = new List<string>();
    for (int i = 0; i < results.Rows.Count; i++ )
    {
        string cellvalue = results.Rows[i][1].ToString();
        for (int j = 0; j < results.Rows.Count; j++)
        {
            if (i != j)
            {
             if (cellvalue.Equals(results.Rows[j][1]))
                {
                    //Duplicate found                            
                    duplicateLists.Add(results.Rows[i][1].ToString() + "_" + i+2 + "_" + j+2);
                }
            }
        }

    }
    return duplicateLists;
}

One optimisation you could make is to do the de-duplication on a sorted data set. 您可以进行的一种优化是对已排序的数据集进行重复数据删除。 Define a DataView which sorts the data on the relevant column, then simply check that the current row's value is not the same as the previous row's value. 定义一个DataView,对相关列上的数据进行排序,然后简单地检查当前行的值与上一行的值是否不同。

Mark Sowul's answer might be a better idea if you aren't bothered about physically removing the rows however. 如果您不担心物理删除行,那么Mark Sowul的答案可能是一个更好的主意。

The problem you've got is that every row has to check every other row, so with more rows the number of checks goes up exponentially. 您遇到的问题是每一行都必须检查每隔一行,因此,随着行数的增加,检查的数量呈指数增长。 The quickest way to handle it is to make it linear - only do as many checks as there are rows. 处理它的最快方法是使其线性化-仅执行与行数一样多的检查。

One way to do this is to sort the data table by column2. 一种方法是按column2对数据表进行排序。 This will put any duplicates in adjacent rows, so then you just need to run through the table checking that one row doesn't match the next one. 这会将所有重复项放在相邻的行中,因此您只需要遍历表,检查一行是否与下一行不匹配。

The other way is to get things at source and make sure the rows are distinct before you read them. 另一种方法是从源头获取内容,并确保在读取行之前它们是不同的。

Use a Dictionary and iterate once over all values and count the occurence of each value => Dictionary key is the column value, Dictionary value is the count. 使用Dictionary并对所有值进行一次迭代,并计数每个值的出现=> Dictionary键是列值,Dictionary值是计数。 Then return all keys where count is more than one. 然后返回计数大于一的所有键。

From: http://social.msdn.microsoft.com/Forums/en/adodotnetdataset/thread/ed9c6a6a-a93e-4bf5-a892-d8471b84aa3b 来自: http : //social.msdn.microsoft.com/Forums/en/adodotnetdataset/thread/ed9c6a6a-a93e-4bf5-a892-d8471b84aa3b

DataTable distinctTable = originalTable.DefaultView.ToTable( /*distinct*/ true);

For your purposes you could make a DataView that includes only the column(s) you're interested in. 为了您的目的,您可以制作一个仅包含您感兴趣的列的DataView。

SQL would be much more efficient way of doing this rather than pulling the entire dataset twice. SQL比执行两次整个数据集要高效得多。

You can do it very quickly if you have an index on the column you are referring to. 如果您要引用的列上有索引,则可以非常快地完成操作。

Just do 做就是了

SELECT id AS matchID, column1 FROM table1 WHERE column1 IN (SELECT column1 FROM table1 WHERE id IS NOT matchId)

or something like that 或类似的东西

Cheers, Niko 干杯,尼可

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM