简体   繁体   中英

Avoiding duplicates in hierarchical parent-child relational collection

I am looking to write linq statement for a simple scenario of collections. I am trying to avoid duplicate items in collection based on parent child relationship. The data structure and sample code is below

public class Catalog
{
    public int CatalogId { get; set; }
    public int ParentCatalogId { get; set; }
    public string CatalogName { get; set; }
}

public class Model
{
     public int CatalogId { get; set; }
     public string ItemName { get; set; }
        ...
}

List<Catalog> Catalogs : Contains the complete list of parent child relations to any level of all the catalogs and the root one with ParentCatalogid=null

List<Model> CollectionA : Contains all the items of child as well as parent catalog for a specific catalogId (till its root).

I need to create a CollectionB from CollectionA that will contain items of the provided catalogId including all the items of all the parents such that if item is present in child catalog, i need to ignore same item in parent catalog. In this way there wont be any duplicate Items if same items is available in child as well as parent.

In terms of code I am trying to achieve something like this

while (catalogId!= null)
{                           
    CollectionB.AddRange(
        CollectionA.Where(x => x.CatalogId == catalogId &&
                               !CollectionB.Select(y => y.ItemName).Contains(x.ItemName))); 
    // Starting from child to parent and ignoring items that are already in CollectionB

    catalogId = Catalogs.
        Where(x => x.Id == catalogId).
        Select(x => x.ParentCatalogId).
        FirstOrDefault();
 }

I know that Contains clause in linq in above statement will not work but just put that statement to explain what i am trying to do. I can do that using foreach loop but just want to use linq . I am looking for correct linq statement to do this. The sample data is given below and will really appreciate if i can get some help

    Catalog

    ID ParenId    CatalogName
    1    null      CatalogA
    2      1       Catalogb
    3      1       CatalogC
    4      2       CatalogD
    5      4       CatalogE

    CollectionA

    CatalogId    ItemName
    5            ItemA
    5            ItemB
    4            ItemA
    4            ItemC
    2            ItemA
    2            ItemC
    1            ItemD

    Expected output
    CollectionB
    5    ItemA
    5    ItemB
    4    ItemC
    1    ItemD

LINQ is not designed to traverse hierarchical data structures as it has been already considered in:

But if you can get the hierarchy of catalogs from child to root then the problem could be solved with join and distinct - LINQ's Distinct() on a particular property :

var modelsForE = (from catalog in flattenedHierarchyOfCatalogE
                  join model in models
                      on catalog.CatalogId equals model.CatalogId
                  select model).
                  GroupBy(model => model.ItemName).
                  Select(modelGroup => modelGroup.First()).
                  Distinct();

Or even better - adapt Jon Skeet's answer for distinct.

It solves the duplicates problem but leaves us with another question : How to get flattenedHierarchyOfCatalogE ?

PURE LINQ SOLUTION:

It is not easy task, but not exactly impossible with pure LINQ. Adapting How to search Hierarchical Data with Linq we get:

public static class LinqExtensions
{
    public static IEnumerable<T> Flatten<T>(this T source, Func<T, IEnumerable<T>> selector)
    {
        return selector(source).SelectMany(c => Flatten(c, selector))
                               .Concat(new[] { source });
    }
}

//...    

var catalogs = new Catalog[] 
{
    new Catalog(1, 0, "CatalogA"),
    new Catalog(2, 1, "Catalogb"),
    new Catalog(3, 1, "CatalogC"),
    new Catalog(4, 2, "CatalogD"),
    new Catalog(5, 4, "CatalogE")
};

var models = new Model[]
{
    new Model(5, "ItemA"),
    new Model(5, "ItemB"),
    new Model(4, "ItemA"),
    new Model(4, "ItemC"),
    new Model(2, "ItemA"),
    new Model(2, "ItemC"),
    new Model(1, "ItemD")
};

var catalogE = catalogs.SingleOrDefault(catalog => catalog.CatalogName == "CatalogE");

var flattenedHierarchyOfCatalogE = catalogE.Flatten((source) =>
    catalogs.Where(catalog => 
        catalog.CatalogId == source.ParentCatalogId));

And then feed the flattenedHierarchyOfCatalogE into the query from the beginning of the question .

WARNING: I have added constructors for your classes, so previous snippet may fail to compile in your project:

public Catalog(Int32 catalogId, Int32 parentCatalogId, String catalogName)
 {
      this.CatalogId = catalogId;
      this.ParentCatalogId = parentCatalogId;
      this.CatalogName = catalogName;
 } //...

SOMETHING TO CONSIDER

There is nothing wrong with previous solution(well, personally I may have considered to use something with less extensive use of LINQ like Recursive Hierarchy - Recursive Query using Linq ), but whichever solution you like you may have one problem: It works, but it doesn't use any optimized datastructures - it is just direct search and selection. If your catalogs grow and queries will execute more often, then the performance may become a problem.

But even if the performance is not a problem then the ease of use of your classes is. Ids, foreign keys are good for relational databases but very unwieldy in OO systems. You may want to consider possible object relational mapping for your classes(or creation of their wrappers(mirrors) that will look something like:

public class Catalog
{
    public Catalog Parent { get; set; }

    public IEnumerable<Catalog> Children { get; set; }

    public string CatalogName { get; set; }
}

public class Model
{
     public Catalog Catalog { get; set; }
     public string ItemName { get; set; }   
}

Such classes are far more self contained and much more easier to use and to traverse their hierarchies. I don't know whether your system is database-driven or not, but you can nonetheless take a look at some object-relational mapping examples and technologies.

PS: LINQ is not an absolute tool in .NET arsenal. No doubts that it is very useful tool applicable in multitude of situations, but not in each of all possible. And if tool cannot help you to solve a problem, then it should be either modified or put aside for a moment.

You are most likely looking for SelectMany() extension. A short example of how it can be used to select all the children for comparison (to avoid duplicates) is below:

var col = new[] { 
    new { name = "joe", children = new [] { 
        new { name = "billy", age=1 },
        new { name = "sally", age=4 }
    }},
    new { name = "bob", children = new [] {
        new { name = "megan", age=10 },
        new { name = "molly", age=7  }
    }}
};

col.SelectMany(c => c.children).Dump("kids");

For more information there are a few questions on stack overflow about this extension and of course you can read the actual msdn documentation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM