简体   繁体   中英

LINQ join to return dynamic column list

I'm looking for a way to return a dynamic column list from a LINQ join of two datatables.

First, this is not a duplicate. I have already studied and discarded:

C# LINQ list select columns dynamically from a joined dataset

Creating a LINQ select from multiple tables

How to do a LINQ join that behaves exactly like a physical database inner join?

(and many others)

Here is my starting point:

public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                        join dataRows2 in dt2.AsEnumerable()
                        on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                        [...I NEED HELP HERE with the SELECT....]).CopyToDataTable();
   return result;
}

A few notes and requirements:

  1. There is no database engine. The data sources are large CSV files (500K+ records) being read into c# DataTable s.
  2. Because the CSVs are large, looping through each record in the join is a bad solution for performance reasons. I've already tried record looping and it's just too slow. I get great performance on the join above, but I can't find a way to have it return just the columns I want (specified by the caller) without looping records.
  3. If I need to loop over columns in the join, that is perfectly fine, I just don't want to loop rows.
  4. I want to be able to pass in an array of column names and return just those columns in the resulting DataTable . If both datatables being passed in happen to have a column named the same, and if that column is in my array of column names, just pass back either column because the data will be the same between the 2 columns in that case.
  5. If I need to pass in 2 arrays (1 for each datatable's desired columns) that's fine, but 1 array of column names would be ideal.
  6. The column list cannot be static and hardcoded into the function. The reason is because my JoinDataTables() is called from many different places in my system in order to join a wide variety of CSVs-turned-datatables, and each CSV file has very different columns.
  7. I don't want all columns returned in the resulting DataTable -- just the columns I specify in the columns array.

So suppose, before calling JoinDataTables() , I have the following 2 datatables:

Table: T1
T1A T1B T1C T1D
==================
10  AA  H1  Foo1
11  AB  H1  Foo2
12  AA  H2  Foo1
13  AB  H2  Foo2

Table: T2
T2A T2X T2Y T2Z
==================
12  N1  O1  Yeah1
17  N2  O2  Yeah2
18  N3  O1  Yeah1
19  N4  O2  Yeah2

Now suppose we join these 2 tables like so: ON T1.T1A = T2.T2A

select * from [join]

and that yields this resultset:

T1A T1B T1C T1D   T2A T2X T2Y T2Z
====================================
12  AA  H2  Foo1  12  N1  O1  Yeah1

Notice that only 1 row is yielded by the join.

Now to the crux of my question. Suppose that for a given use case, I want to return only 4 columns from this join: T1A, T1D, T2A, and T2Y. So my resultset would then look like this:

T1A T1D   T2A  T2Y
==================
12  Foo1  12   O1

I'd like to be able to call my JoinDataTables function like so:

DataTable dt = JoinDataTables(dt1, dt2, "T1A", "T2A", new string[] {"T1A", "T1D", "T2A", "T2Y"});

Keeping in mind performance and the fact that I don't want to loop through records (because it's slow for large sets of data), how can this be accomplished? (The join is already working well, now I just need a correct select segment (whether via new{..} or whatever you think)).

I cannot accept a solution with a hardcoded column list inside the function. I have found examples of that approach all over SO .

Any ideas?

EDIT: I'd be ok getting ALL columns back every time, but every attempt I've made to include all columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join.

I'm not sure of the performance with 500k records, but here is an attempted solution.

Since you are combining two subsets of DataRow s from different tables, there are no easy operations that will create the subset or create a new DataTable from the subsets (though I have an extension method for flattening an IEnumerable<anon> where anon = new { DataRow1, DataRow2, ... } from a join, it would probably be slow for you).

Instead, I pre-create an answer DataTable with the columns requested and then use LINQ to build the value arrays to be added as the rows.

public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
    var rtnCols1 = dt1.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName)).ToList();
    var rc1 = rtnCols1.Select(dc => dc.ColumnName).ToList();
    var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1.Contains(dc.ColumnName)).ToList();
    var rc2 = rtnCols2.Select(dc => dc.ColumnName).ToList();

    var work = from dataRows1 in dt1.AsEnumerable()
               join dataRows2 in dt2.AsEnumerable()
               on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
               select (from c1 in rc1 select dataRows1[c1]).Concat(from c2 in rc2 select dataRows2[c2]).ToArray();

    var result = new DataTable();
    foreach (var rc in rtnCols1)
        result.Columns.Add(rc.ColumnName, rc.DataType);
    foreach (var rc in rtnCols2)
        result.Columns.Add(rc.ColumnName, rc.DataType);

    foreach (var rowVals in work)
        result.Rows.Add(rowVals);

    return result;
}

Since you were using query syntax, I did as well, but normally I would probably do the select like so:

select rc1.Select(c1 => dataRows1[c1]).Concat(rc2.Select(c2 => dataRows2[c2])).ToArray();

Updated: It is probably worthwhile to use the column ordinals instead of the names to index into each DataRow by replacing the definitions of rc1 and rc2 :

var rc1 = rtnCols1.Select(dc => dc.Ordinal).ToList();
var rc1Names = rtnCols1.Select(dc => dc.ColumnName).ToHashSet();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1Names.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.Ordinal).ToList();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM