简体   繁体   中英

Linq to compare 2 different lists and select the outer join

I have 2 different classes that represent 2 types of data. The first is the unposted raw format. The second is the posted format.

 public class SalesRecords
 {
    public long? RecordId { get; set; }
    public DateTime RecordDate { get; set; }
    public string RecordDesc { get; set; }

    // Other non-related properties and methods
 }

 public class PostedSalesRecords
 {
      public long? CorrelationId { get; set; }
      public DateTime RecordDate { get; set; }
      public DateTime? PostedDate { get; set; }
      public string RecordDesc { get; set; }

      // Other non-related properties and methods
 }

Our system has a list of sales records. These sales records are posted to a different system at a time determined by the users. I am creating a screen that will show all of the posted sales records along with the unposted sales records as a reconciliation. The datasource for my grid will be a list of PostedSalesRecords. What I need to do is find out which records out of the List<SalesRecords> that are not in List<PostedSalesRecords> and then map those unposted SalesRecords to a PostedSalesRecords. I am having trouble finding a way to quickly compare. Basically I tried this, and it was EXTREMELY slow:

 private List<SalesRecords> GetUnpostedSalesRecords(
      List<SalesRecords> allSalesRecords,
      List<PostedSalesRecords> postedSalesRecords)
 {
      return allSalesRecords.Where(x => !(postedSalesRecords.Select(y => y.CorrelationId.Value).Contains(x.RecordId.Value))).ToList();
 }

My biggest issue is that I am filtering through a lot of data. I am comparing ~55,000 total sales records to about 17,000 posted records. It takes about 2 minutes for me to process this. Any possible way to speed this up? Thanks!

You can try an outer join, please see if this helps with the performance:

 var test = (from s in allSalesRecords
                join p in postedSalesRecords on s.RecordId equals p.CorrelationId into joined
                from j in joined.DefaultIfEmpty()
                where j == null
                select s).ToList();

Or in your implementation, you can create a dictionary of only Ids for postedSalesRecords and then use that collection in your query, it'll definitely help with performance because the lookup time will be O(1) instead of traversing through the whole collection for each record.

 var postedIds = postedSalesRecords.Select(y => y.CorrelationId.Value)
                                      .Distinct().ToDictionary(d=>d);
return allSalesRecords.Where(x => !(postedIds.ContainsKey(x.RecordId.Value))).ToList();

Using a left outer join as described on MSDN should work much more efficiently:

private List<SalesRecords> GetUnpostedSalesRecords(
    List<SalesRecords> allSalesRecords,
    List<PostedSalesRecords> postedSalesRecords)
{
    return (from x in allSalesRecords
            join y in postedSalesRecords on x.RecordId.Value
                                     equals y.CorrelationId.Value into joined
            from z in joined.DefaultIfEmpty()
            where z == null
            select x).ToList();
}

This will probably be implemented with a hash set. You could implement this yourself (arguably clearer that way): build a HashSet<long> of the ID values in one or both lists to ensure that you don't need repetitive O(N) lookups each time you go through the outer list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM