简体   繁体   English

确定数据表中的重复项

[英]Determining duplicates in a datatable

I have a data table I've loaded from a CSV file. 我有一个从CSV文件加载的数据表。 I need to determine which rows are duplicates based on two columns ( product_id and owner_org_id ) in the datatable. 我需要根据数据表中的两列( product_idowner_org_id )确定哪些行是重复的。 Once I've determined that, I can use that information to build my result, which is a datatable containing only the rows that are not unique, and a data table containing only the rows that are unique. 一旦我确定了,我可以使用该信息来构建我的结果,这是一个只包含非唯一行的数据表,以及一个只包含唯一行的数据表。

I've looked at other examples on here and the code I've come up with so far does compile and execute, but it seems to think every row in the data is unique. 我在这里看了其他的例子,到目前为止我提出的代码编译和执行,但它似乎认为数据中的每一行都是唯一的。 In reality in the test data there's 13 rows and only 6 are unique. 实际上,在测试数据中有13行,只有6行是唯一的。 So clearly I'm doing something wrong. 显然,我做错了什么。

EDIT : Thought I should note, rows that have duplicates should ALL be removed, not just the duplicates of that row. 编辑 :我想我应该注意,具有重复的行应该全部删除,而不仅仅是该行的重复。 eg if there are 4 duplicates, all 4 should be removed not 3, leaving one unique row from the 4. 例如,如果有4个重复项,则应删除所有4个而不是3个,从4中留下一个唯一的行。

EDIT2 : Alternatively, if I can select all duplicate rows (instead of trying to select unique rows) it is fine with me. EDIT2 :或者,如果我可以选择所有重复的行(而不是尝试选择唯一的行),那对我来说没问题。 Either way can get me to my end result. 无论哪种方式都可以让我得到我的最终结果。

The code in the processing method: 处理方法中的代码:

MyRowComparer myrc = new MyRowComparer();
var uniquerows = dtCSV.AsEnumerable().Distinct(myrc);

along with the following: 以及以下内容:

public class MyRowComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        //return ((string.Compare(x.Field<string>("PRODUCT_ID"),   y.Field<string>("PRODUCT_ID"),   true)) ==
        //        (string.Compare(x.Field<string>("OWNER_ORG_ID"), y.Field<string>("OWNER_ORG_ID"), true)));
        return
            x.ItemArray.Except(new object[] { x[x.Table.Columns["PRODUCT_ID"].ColumnName] }) ==
            y.ItemArray.Except(new object[] { y[y.Table.Columns["PRODUCT_ID"].ColumnName] }) &&
            x.ItemArray.Except(new object[] { x[x.Table.Columns["OWNER_ORG_ID"].ColumnName] }) ==
            y.ItemArray.Except(new object[] { y[y.Table.Columns["OWNER_ORG_ID"].ColumnName] });
    }

    public int GetHashCode(DataRow obj)
    {
        int y = int.Parse(obj.Field<string>("PRODUCT_ID"));
        int z = int.Parse(obj.Field<string>("OWNER_ORG_ID"));
        int c = y ^ z;
        return c;
    }
}

You could use LINQ-To-DataSet and Enumerable.Except / Intersect : 您可以使用LINQ-To-DataSet和Enumerable.Except / Intersect

var tbl1ID = tbl1.AsEnumerable()
        .Select(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id"),
        });
var tbl2ID = tbl2.AsEnumerable()
        .Select(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id"),
        });


var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);

var tblUnique = (from uniqueRow in unique
                join row in tbl1.AsEnumerable()
                on uniqueRow equals new
                {
                    product_id = row.Field<String>("product_id"),
                    owner_org_id = row.Field<String>("owner_org_id")
                }
                select row).CopyToDataTable();
var tblBoth = (from bothRow in both
              join row in tbl1.AsEnumerable()
              on bothRow equals new
              {
                  product_id = row.Field<String>("product_id"),
                  owner_org_id = row.Field<String>("owner_org_id")
              }
              select row).CopyToDataTable();

Edit : Obviously i've misunderstood your requirement a little bit. 编辑 :显然我已经误解了你的要求了一点点。 So you only have one DataTable and want to get all unique and all duplicate rows, that's even more straight-forward. 因此,您只有一个DataTable并希望获得所有唯一且所有重复的行,这更加直截了当。 You can use Enumerable.GroupBy with an anonymous type containing both fields: 您可以将Enumerable.GroupBy与包含两个字段的匿名类型一起使用:

var groups = tbl1.AsEnumerable()
    .GroupBy(r => new
    {
        product_id = r.Field<String>("product_id"),
        owner_org_id = r.Field<String>("owner_org_id")
    });
var tblUniques = groups
    .Where(grp => grp.Count() == 1)
    .Select(grp => grp.Single())
    .CopyToDataTable();
var tblDuplicates = groups
    .Where(grp => grp.Count() > 1)
    .SelectMany(grp => grp)
    .CopyToDataTable();

Your criterium is off. 你的标准是关闭的。 You are comparing sets of objects that you are not interested ( Except excludes) in. 您正在比较您不感兴趣的对象集( Except排除在外)。

Instead, be as clear (data type) as possible and keep it simple: 相反,尽可能清楚(数据类型)并保持简单:

public bool Equals(DataRow x, DataRow y)
{   
    // Usually you are dealing with INT keys
    return (x["PRODUCT_ID"] as int?) == (y["PRODUCT_ID"] as int?)
      && (x["OWNER_ORG_ID"] as int?) == (y["OWNER_ORG_ID"] as int?);

    // If you really are dealing with strings, this is the equivalent:
    // return (x["PRODUCT_ID"] as string) == (y["PRODUCT_ID"] as string)
    //  && (x["OWNER_ORG_ID"] as string) == (y["OWNER_ORG_ID"] as string)
}  

Check for null if that is a possibility. 如果可能,请检查null Maybe you want to exclude rows that are equal because their IDs are null. 也许你想要排除相同的行,因为它们的ID是null。

Observe the int? 观察int? . This is not a typo. 这不是一个错字。 The question mark is required if you are dealing with database values from columns that can be NULL . 如果要处理来自可以为NULL列的数据库值,则需要问号。 The reason is that NULL values will be represented by the type DBNull in C#. 原因是NULL值将由C#中的DBNull类型表示。 Using the as operator just gives you null in this case (instead of an InvalidCastException . If you are sure, you are dealing with INT NOT NULL , cast with (int) . 在这种情况下使用as运算符只会给你null (而不是InvalidCastException 。如果你确定,你正在处理INT NOT NULL ,使用(int)

The same is true for strings. 字符串也是如此。 (string) asserts you are expecting non-null DB values. (string)断言您期望非空DB值。

EDIT1: EDIT1:

Had the type wrong. 这个类型错了。 ItemArray is not a hashtable. ItemArray不是哈希表。 Use the row directly. 直接使用该行。

EDIT2: EDIT2:

Added string example, some comment 添加了string示例,一些评论

For a more straight-forward way, check How to select distinct rows in a datatable and store into an array 要获得更直接的方法,请检查如何选择数据表中的不同行并存储到数组中

EDIT3: EDIT3:

Some explanation regarding the casts. 关于演员的一些解释。

The other link I suggested does the same as your code. 我建议的另一个链接与您的代码相同。 I forgot your original intent ;-) I just saw your code and responded to the most obvious error, I saw - sorry 我忘记了你原来的意图;-)我刚看到你的代码并回答了最明显的错误,我看到了 - 抱歉

Here is how I would solve the problem 这是我如何解决问题

using System.Linq;
using System.Data.Linq;

var q = dtCSV
    .AsEnumerable()
    .GroupBy(r => new { ProductId = (int)r["PRODUCT_ID"], OwnerOrgId = (int)r["OWNER_ORG_ID"] })
    .Where(g => g.Count() > 1).SelectMany(g => g);

var duplicateRows = q.ToList();

I don't know if this 100% correct, I don't have an IDE at hand. 我不知道这100%是否正确,我手头没有IDE。 And you'll need to adjust the casts to the appropriate type. 你需要将演员阵容调整到合适的类型。 See my addition above. 见上面我的补充。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM