從數據表中刪除重復項的最佳方法是什么？

Question

我檢查了整個網站並在網上搜索，但無法找到解決此問題的簡單方法。

我有一個數據表，它有大約 20 列和 10K 行。 我需要根據 4 個關鍵列刪除此數據表中的重復行。 .Net 沒有這樣做的功能嗎？ 最接近我正在尋找的函數是 datatable.DefaultView.ToTable(true, array of columns to display)，但是這個函數在所有列上都不同。

如果有人可以幫助我，那就太好了。

編輯：我很抱歉不清楚這一點。 此數據表是通過讀取 CSV 文件而不是從數據庫中創建的。 因此，使用 SQL 查詢不是一種選擇。

Answer 1

您可以使用Linq到數據集。 檢查一下。 像這樣的東西：

// Fill the DataSet.
DataSet ds = new DataSet();
ds.Locale = CultureInfo.InvariantCulture;
FillDataSet(ds);

List<DataRow> rows = new List<DataRow>();

DataTable contact = ds.Tables["Contact"];

// Get 100 rows from the Contact table.
IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
                              select c).Take(100);

DataTable contactsTableWith100Rows = query.CopyToDataTable();

// Add 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

// Create duplicate rows by adding the same 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

DataTable table =
    System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);

// Find the unique contacts in the table.
IEnumerable<DataRow> uniqueContacts =
    table.AsEnumerable().Distinct(DataRowComparer.Default);

Console.WriteLine("Unique contacts:");
foreach (DataRow uniqueContact in uniqueContacts)
{
    Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
}

Answer 2

如何刪除重復的行？ 。 （調整那里的查詢以加入4個關鍵列）

編輯：使用您的新信息我相信最簡單的方法是實現IEqualityComparer <T>並在數據行上使用Distinct。 否則，如果你正在使用IEnumerable / IList而不是DataTable / DataRow，那么一些LINQ-to-objects kung-fu肯定是可能的。

編輯：示例IEqualityComparer

public class MyRowComparer : IEqualityComparer<DataRow>
{

    public bool Equals(DataRow x, DataRow y)
    {
        return (x.Field<int>("ID") == y.Field<int>("ID")) &&
            string.Compare(x.Field<string>("Name"), y.Field<string>("Name"), true) == 0 &&
          ... // extend this to include all your 4 keys...
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<int>("ID").GetHashCode() ^ obj.Field<string>("Name").GetHashCode() etc.
    }
}

你可以像這樣使用它：

var uniqueRows = myTable.AsEnumerable().Distinct(MyRowComparer);

Answer 3

我認為這必須是使用Linq和moreLinq代碼從Datatable中刪除重復項的最佳方法：

LINQ

RemoveDuplicatesRecords(yourDataTable);


private DataTable RemoveDuplicatesRecords(DataTable dt)
{
    var UniqueRows = dt.AsEnumerable().Distinct(DataRowComparer.Default);
    DataTable dt2 = UniqueRows.CopyToDataTable();
    return dt2;
}

博客文章：從DataTable Asp.net刪除重復行記錄c＃

MoreLinq

// Distinctby  column name ID 
var valueDistinctByIdColumn = yourTable.AsEnumerable().DistinctBy(row => new { Id = row["Id"] });
DataTable dtDistinctByIdColumn = valueDistinctByIdColumn.CopyToDataTable();

注意： moreLinq需要添加庫。

在morelinq中，您可以使用名為DistinctBy的函數，您可以在其中指定要在其上查找Distinct對象的屬性。

博客文章：使用moreLinq DistinctBy方法刪除重復記錄

Answer 4

如果您有權訪問Linq我認為您應該能夠使用內存集合中的內置組功能並選擇重復的行

例如，在Google上搜索Linq Group

Answer 5

應該考慮必須調用Table.AcceptChanges（）來完成刪除。 否則，刪除的行仍然存在於DataTable中，RowState設置為Deleted。 刪除后，Table.Rows.Count不會更改。

Answer 6

我並不熱衷於使用上面的Linq解決方案，所以我寫了這個：

/// <summary>
/// Takes a datatable and a column index, and returns a datatable without duplicates
/// </summary>
/// <param name="dt">The datatable containing duplicate records</param>
/// <param name="ComparisonFieldIndex">The column index containing duplicates</param>
/// <returns>A datatable object without duplicated records</returns>
public DataTable duplicateRemoval(DataTable dt, int ComparisonFieldIndex)
{
    try
    {
        //Build the new datatable that will be returned
        DataTable dtReturn = new DataTable();
        for (int i = 0; i < dt.Columns.Count; i++)
        {
            dtReturn.Columns.Add(dt.Columns[i].ColumnName, System.Type.GetType("System.String"));
        }

        //Loop through each record in the datatable we have been passed
        foreach (DataRow dr in dt.Rows)
        {
            bool Found = false;
            //Loop through each record already present in the datatable being returned
            foreach (DataRow dr2 in dtReturn.Rows)
            {
                bool Identical = true;
                //Compare the column specified to see if it matches an existing record
                if (!(dr2[ComparisonFieldIndex].ToString() == dr[ComparisonFieldIndex].ToString()))
                {
                    Identical = false;
                }
                //If the record found identically matches one we already have, don't add it again
                if (Identical)
                {
                    Found = true;
                    break;
                }
            }
            //If we didn't find a matching record, we'll add this one
            if (!Found)
            {
                DataRow drAdd = dtReturn.NewRow();
                for (int i = 0; i < dtReturn.Columns.Count; i++)
                {
                    drAdd[i] = dr[i];
                }

                dtReturn.Rows.Add(drAdd);
            }
        }
        return dtReturn;
    }
    catch (Exception)
    {
        //Return the original datatable if something failed above
        return dt;
    }
}

此外，這適用於所有列而不是特定的列索引：

/// <summary>
/// Takes a datatable and returns a datatable without duplicates
/// </summary>
/// <param name="dt">The datatable containing duplicate records</param>
/// <returns>A datatable object without duplicated records</returns>
public DataTable duplicateRemoval(DataTable dt)
{
    try
    {
        //Build the new datatable that will be returned
        DataTable dtReturn = new DataTable();
        for (int i = 0; i < dt.Columns.Count; i++)
        {
            dtReturn.Columns.Add(dt.Columns[i].ColumnName, System.Type.GetType("System.String"));
        }

        //Loop through each record in the datatable we have been passed
        foreach (DataRow dr in dt.Rows)
        {
            bool Found = false;
            //Loop through each record already present in the datatable being returned
            foreach (DataRow dr2 in dtReturn.Rows)
            {
                bool Identical = true;
                //Compare all columns to see if they match the existing record
                for (int i = 0; i < dt.Columns.Count; i++)
                {
                    if (!(dr2[i].ToString() == dr[i].ToString()))
                    {
                        Identical = false;
                    }
                }
                //If the record found identically matches one we already have, don't add it again
                if (Identical)
                {
                    Found = true;
                    break;
                }
            }
            //If we didn't find a matching record, we'll add this one
            if (!Found)
            {
                DataRow drAdd = dtReturn.NewRow();
                for (int i = 0; i < dtReturn.Columns.Count; i++)
                {
                    drAdd[i] = dr[i];
                }

                dtReturn.Rows.Add(drAdd);
            }
        }
        return dtReturn;
    }
    catch (Exception)
    {
        //Return the original datatable if something failed above
        return dt;
    }
}

Answer 7

這是一個非常簡單的代碼，它不需要linq或單獨的列來執行過濾。 如果一行中列的所有值都為null，則將刪除它。

    public DataSet duplicateRemoval(DataSet dSet) 
{
    bool flag;
    int ccount = dSet.Tables[0].Columns.Count;
    string[] colst = new string[ccount];
    int p = 0;

    DataSet dsTemp = new DataSet();
    DataTable Tables = new DataTable();
    dsTemp.Tables.Add(Tables);

    for (int i = 0; i < ccount; i++)
    {
        dsTemp.Tables[0].Columns.Add(dSet.Tables[0].Columns[i].ColumnName, System.Type.GetType("System.String"));
    }

    foreach (System.Data.DataRow row in dSet.Tables[0].Rows)
    {
        flag = false;
        p = 0;
        foreach (System.Data.DataColumn col in dSet.Tables[0].Columns)
        {
            colst[p++] = row[col].ToString();
            if (!string.IsNullOrEmpty(row[col].ToString()))
            {  //Display only if any of the data is present in column
                flag = true;
            }
        }
        if (flag == true)
        {
            DataRow myRow = dsTemp.Tables[0].NewRow();
            //Response.Write("<tr style=\"background:#d2d2d2;\">");
            for (int kk = 0; kk < ccount; kk++)
            {
                myRow[kk] = colst[kk];         

                // Response.Write("<td class=\"table-line\" bgcolor=\"#D2D2D2\">" + colst[kk] + "</td>");
            }
            dsTemp.Tables[0].Rows.Add(myRow);
        }
    } return dsTemp;
}

這甚至可以用於從Excel工作表中刪除空數據。

Answer 8

使用查詢而不是函數：

DELETE FROM table1 AS tb1 INNER JOIN 
(SELECT id, COUNT(id) AS cntr FROM table1 GROUP BY id) AS tb2
ON tb1.id = tb2.id WHERE tb2.cntr > 1

Answer 9

Liggett78的答案要好得多 - 尤其是 因為我的錯誤！ 更正如下......

DELETE TableWithDuplicates
    FROM TableWithDuplicates
        LEFT OUTER JOIN (
            SELECT PK_ID = Min(PK_ID), --Decide your method for deciding which rows to keep
                KeyColumn1,
                KeyColumn2,
                KeyColumn3,
                KeyColumn4
                FROM TableWithDuplicates
                GROUP BY KeyColumn1,
                    KeyColumn2,
                    KeyColumn3,
                    KeyColumn4
            ) AS RowsToKeep
            ON TableWithDuplicates.PK_ID = RowsToKeep.PK_ID
    WHERE RowsToKeep.PK_ID IS NULL

Answer 10

在bytes.com上找到這個：

您可以將JET 4.0 OLE DB提供程序與System.Data.OleDb命名空間中的類一起使用，以訪問逗號分隔的文本文件（使用DataSet / DataTable）。

或者，您可以使用Microsoft Text Driver for ODBC和System.Data.Odbc命名空間中的類來使用ODBC驅動程序訪問該文件。

這將允許您通過SQL查詢訪問您的數據，正如其他人提出的那樣。

Answer 11

“這個數據表是通過讀取CSV文件而不是從數據庫創建的。”

因此，對數據庫中的四列放置一個唯一約束，並且在您的設計下插入的重復項將不會進入。除非它決定失敗而不是在發生這種情況時繼續，但這肯定可以在CSV導入腳本中配置。

Answer 12

為了完成，我附上了一個基於此處已有的一些答案的示例。 當其余列可能不同時，此解決方案按 fieldKey1 將表過濾為 N。 但也過濾重復中匹配的第一個，其他兩列的最小值：

return dt.AsEnumerable()
    .Distinct(DataRowComparer.Default)
    .GroupBy(r => new
    {
        fieldKey1 = r.Field<int>("fieldKey1"), 
        fieldKey2 = r.Field<string>("fieldKey2"), 
        fieldKeyn = r.Field<DateTime>("fieldKeyn")
    })
    .Select(g =>  
        g.OrderBy( dr => dr.Field<int>( "OtherField1" ) )
            .ThenBy( dr => dr.Field<int>( "OtherField2" ) )
                .First())
    .CopyToDataTable();

所以數據表dt：

字段鍵1	字段鍵2	字段鍵	其他字段1	其他字段2	其他領域3
1	二	31-12-2020	4	3	xyz7
2	其他	31-12-2021	4	3	xyz100
1	二	31-12-2020	2	2	xyz3
1	二	31-12-2020	2	3	xyz4
1	二	31-12-2020	1	2	xyz1
1	二	31-12-2020	1	4	xyz2
1	二	31-12-2020	3	3	xyz5
1	二	31-12-2020	3	3	xyz6

將返回：

字段鍵1	字段鍵2	字段鍵	其他字段1	其他字段2	其他領域3
1	二	31-12-2020	1	2	xyz1
2	其他	31-12-2021	4	3	xyz100

Answer 13

試試這個

讓我們考慮dtInput是具有重復記錄的數據表。

我有一個新的DataTable dtFinal，我想在其中過濾重復的行。

所以我的代碼將如下所示。

DataTable dtFinal = dtInput.DefaultView.ToTable(true, 
                           new string[ColumnCount] {"Col1Name","Col2Name","Col3Name",...,"ColnName"});

從數據表中刪除重復項的最佳方法是什么？

問題描述

13 個解決方案

解決方案1
8 已采納 2008-12-04 11:19:23

解決方案2
7 2008-12-04 11:13:34

解決方案3
1 2013-05-09 10:29:03

MoreLinq

解決方案4
1 2008-12-04 11:17:53

解決方案5
1 2011-10-01 11:41:30

解決方案6
0 2013-04-03 22:54:03

解決方案7
0 2010-05-27 09:31:09

解決方案8
0 2008-12-04 11:11:30

解決方案9
0

解決方案10
0 2008-12-04 11:23:35

解決方案11
0 2008-12-04 11:26:11

解決方案12
0 2022-01-20 18:09:42

解決方案13
0 2011-12-16 07:47:52

從數據表中刪除重復項的最佳方法是什么？

問題描述

13 個解決方案

解決方案1 8 已采納 2008-12-04 11:19:23

解決方案2 7 2008-12-04 11:13:34

解決方案3 1 2013-05-09 10:29:03

MoreLinq

解決方案4 1 2008-12-04 11:17:53

解決方案5 1 2011-10-01 11:41:30

解決方案6 0 2013-04-03 22:54:03

解決方案7 0 2010-05-27 09:31:09

解決方案8 0 2008-12-04 11:11:30

解決方案9 0

解決方案10 0 2008-12-04 11:23:35

解決方案11 0 2008-12-04 11:26:11

解決方案12 0 2022-01-20 18:09:42

解決方案13 0 2011-12-16 07:47:52

解決方案1
8 已采納 2008-12-04 11:19:23

解決方案2
7 2008-12-04 11:13:34

解決方案3
1 2013-05-09 10:29:03

解決方案4
1 2008-12-04 11:17:53

解決方案5
1 2011-10-01 11:41:30

解決方案6
0 2013-04-03 22:54:03

解決方案7
0 2010-05-27 09:31:09

解決方案8
0 2008-12-04 11:11:30

解決方案9
0

解決方案10
0 2008-12-04 11:23:35

解決方案11
0 2008-12-04 11:26:11

解決方案12
0 2022-01-20 18:09:42

解決方案13
0 2011-12-16 07:47:52