简体   繁体   English

SqlBulkCopy WriteToServer 使用 IDataReader 而不是 DataTable 和以编程方式调整的字段值

[英]SqlBulkCopy WriteToServer with an IDataReader instead of DataTable and Programmatically Adjusted Field Values

We have a working code in C# that utilizes SqlBulkCopy to insert records into a table from a stored procedure source.我们在 C# 中有一个工作代码,它利用 SqlBulkCopy 将记录从存储过程源插入到表中。 At a high-level:在高层次上:

  1. Reads data from a stored procedure that puts the records into a DataTable.从将记录放入 DataTable 的存储过程中读取数据。 Essentially calls the SP and does an AdpAdapter to put the records into the DataTable.本质上调用 SP 并执行 AdpAdapter 将记录放入 DataTable。 Let's call this srcDataTable.我们称之为 srcDataTable。
  2. Dynamically maps the column names between source and destination through configuration, a table that's similar to the following:通过配置动态映射source和destination之间的列名,类似如下的表:
TargetTableName目标表名 ColumnFromSource ColumnFromSource ColumnInDestination列在目的地 DefaultValue默认值 Formatting格式化
TableA表A StudentFirstName学生名字 FirstName NULL无效的 NULL无效的
TableA表A StudentLastName学生姓氏 LastName NULL无效的 NULL无效的
TableA表A Birthday生日 Birthdate出生日期 1/1/1900 1900 年 1 月 1 日 dd/MM/yyyy日/月/年
  1. Based on the mapping from #2, set up new rows from srcDataTable using .NewRow() of a DataRow to another DataTable that matches the structure of the destination table (where ColumnNameOfDestination is based).根据 #2 的映射,使用 DataRow 的 .NewRow() 将新行从 srcDataTable 设置到与目标表结构匹配的另一个 DataTable(ColumnNameOfDestination 所在的位置)。 Let's call this targetDataTable.我们称它为 targetDataTable。 As you can see from the table, there may be instances where the value from the source is not specified, or needs to be formatted a certain way.从表中可以看出,可能存在未指定源值或需要以某种方式格式化的情况。 This is the primary reason why we're having to add data rows on the fly to another data table, and the adjustment / defaulting of the values are handled in code.这是我们必须动态地将数据行添加到另一个数据表的主要原因,并且值的调整/默认值是在代码中处理的。
  2. Call SqlBulkCopy to write all the rows in targetDataTable to the actual SQL table.调用 SqlBulkCopy 将 targetDataTable 中的所有行写入实际的 SQL 表。

This approach has been working alright in tandem with stored procedures that utilize FETCH and OFFSET so it only returns an X number of rows at a time to deal with memory constraints.这种方法与使用 FETCH 和 OFFSET 的存储过程一起工作得很好,因此它一次只返回 X 行来处理内存限制。 Unfortunately, as we're getting more and more data sources that are north of 50 million rows, and that we're having to share servers, we're needing to find a faster way to do so while keeping memory consumption in check.不幸的是,随着我们获得越来越多的超过 5000 万行的数据源,并且我们不得不共享服务器,我们需要找到一种更快的方法来做到这一点,同时控制内存消耗。 Researching options, it seems like utilizing an IDataReader for SQLBulkCopy will allow us to limit the memory consumption of the code, and not having to delegate getting X number of records in the stored procedure itself anymore.研究选项,似乎为 SQLBulkCopy 使用 IDataReader 将允许我们限制代码的内存消耗,并且不必再委托在存储过程本身中获取 X 条记录。

In terms of preserving current functionality, it looks like we can utilize SqlBulkCopyMappingOptions to allow us to maintain mapping the fields even if they're named differently.在保留当前功能方面,看起来我们可以利用 SqlBulkCopyMappingOptions 来允许我们维护映射字段,即使它们的名称不同。 What I can't confirm however is the defaulting or formatting of the values.但是,我无法确认的是值的默认设置或格式设置。

Is there a way to extend the DataReader's Read() method so that we can introduce that same logic to revise whatever value will be written to the destination if there's configuration asking us to?有没有办法扩展 DataReader 的 Read() 方法,以便我们可以引入相同的逻辑来修改将写入目标的任何值,如果有配置要求我们这样做? So a) check if the current row has a value populated from the source, b) default its value to the destination table if configured, and c) apply formatting rules as it gets written to the destination table.因此 a) 检查当前行是否具有从源填充的值,b) 如果已配置,则将其值默认为目标表,以及 c) 在写入目标表时应用格式化规则。

You appear to be asking "can I make my own class that implements IDataReader and has some altered logic to the Read() method?"您似乎在问“我可以创建自己的类来实现 IDataReader 并对 Read() 方法进行一些更改的逻辑吗?”

The answer's yes;答案是肯定的; you can write your own data reader that does whatever it likes in Read(), format the server's hard disk as soon as it's called even.. When you're implementing an interface you aren't "extend[ing] the DataReader's read method", you're providing your own implementation that externally appears to obey a specific contract but the implementation detail is entirely up to you.您可以编写自己的数据读取器,在 Read() 中执行任何操作,甚至在调用服务器的硬盘时对其进行格式化。当您实现接口时,您不会“扩展 [ing] DataReader 的读取方法",您提供的是您自己的实现,该实现在外部似乎遵守特定合同,但实现细节完全取决于您。 If you want, upon every read, to pull down a row from db X into a temp array, zip through the array tweaking the values to have some default or other adjustment, before returning true that's fine..如果您想在每次读取时将 db X 中的一行下拉到一个临时数组中,请在数组中压缩调整值以进行一些默认或其他调整,然后返回 true,这很好..

..if you wanted to do the value adjustment in the GetXXX, then that's also fine.. you're writing the reader so you decide. ..如果您想在 GetXXX 中进行值调整,那也没关系..您正在编写阅读器,因此您决定。 All the bulk copier is going to do is call Read until it returns false and write the data it gets from eg GetValue (if it wasn't immediately clear: read doesn't produce the data that will be written, GetValue does. Read is just an instruction to move to the next set of data that must be written but it doesn't even have to do that. You could implement it as { return DateTime.Now.DayOfWeek == DayOfWeek.Monday; } and GetValue as { return Guid.NewGuid().ToString(); } and your copy operation would spend until 23:59:59.999 filling the database with guids, but only on Monday)批量复制器要做的就是调用 Read 直到它返回 false 并写入从 GetValue 获取的数据(如果不是立即清除:读取不会产生将要写入的数据,GetValue 会。读取是只是移动到必须写入的下一组数据的指令,但它甚至不必这样做。您可以将其实现为{ return DateTime.Now.DayOfWeek == DayOfWeek.Monday; }和 GetValue as { return Guid.NewGuid().ToString(); }并且您的复制操作将花费直到 23:59:59.999 用 guid 填充数据库,但仅在星期一)

The question is a bit unclear.这个问题有点不清楚。 It looks like the actual question is whether it's possible to transform data before using SqlBulkCopy with a data reader.看起来实际的问题是是否可以在将 SqlBulkCopy 与数据读取器一起使用之前转换数据。

There are a lot of ways to do it, and the appropriate one depends on how the rest of the ETL code does.有很多方法可以做到这一点,合适的方法取决于 ETL 代码的其余部分是如何做的。 Does it only work with data readers?它仅适用于数据读取器吗? Or does it load batches of rows that can be modified in memory?或者它是否加载了可以在内存中修改的成批行?

Use IEnumerable<> and ObjectReader使用 IEnumerable<> 和 ObjectReader

FastMember's ObjectReader class creates an IDataReader wrapper over any IEnumerable<T> collection. FastMember 的 ObjectReader类在任何IEnumerable<T>集合上创建一个 IDataReader 包装器。 This means that both strongly-typed .NET collections and iterator results can be sent to SqlBulkCopy.这意味着强类型的 .NET 集合迭代器结果都可以发送到 SqlBulkCopy。

IEnumerable<string> lines=File.ReadLines(filePath);

using(var bcp = new SqlBulkCopy(connection)) 
using(var reader = ObjectReader.Create(lines, "FileName")) 
{ 
  bcp.DestinationTableName = "SomeTable"; 
  bcp.WriteToServer(reader); 
}

It's possible to create a transformation pipeline using LINQ queries and iterator methods this way, and feed the result to SqlBulkCopy using ObjectReader .可以通过这种方式使用 LINQ 查询和迭代器方法创建转换管道,并使用ObjectReader将结果提供给SqlBulkCopy The code is a lot simpler than trying to create a custom IDataReader .该代码尝试创建自定义IDataReader简单得多。

In this example, Dapper can be used to return query results as an IEnumerable<> :在此示例中, Dapper可用于以IEnumerable<>的形式返回查询结果:

IEnumerable<Order> orders=connection.Query<Order>("select ... where category=@category",
                                                  new {category="Cars"});

var ordersWithDate=orders.Select(ord=>new OrderWithDate {
    ....
    SaleDate=DateTime.Parse(ord.DateString,CultureInfo.GetCultureInfo("en-GB");
});

using var reader = ObjectReader.Create(ordersWithDate, "Id","SaleDate",...));

Custom transforming data readers自定义转换数据读取器

It's also possible to create custom data readers by implementing the IDataReader interface.也可以通过实现IDataReader接口来创建自定义数据读取器。 Libraries like ExcelDataReader and CsvHelper provide such wrappers over their results. ExcelDataReader 和 CsvHelper 等库在其结果上提供了此类包装器。 CsvHelper's CsvDataReader creates an IDataReader wrapper over the parsed CSV results. CsvHelper 的CsvDataReader在解析的 CSV 结果上创建一个IDataReader包装器。 The downside to this is that IDataReader has a lot of methods to implement.这样做的缺点是IDataReader很多方法可以实现。 The GetSchemaTable will have to be implemented to provide column and information to later transformation steps and SqlBulkCopy.必须实现GetSchemaTable以向后面的转换步骤和 SqlBulkCopy 提供列和信息。

IDataReader may be dynamic, but it requires adding a lot of hand-coded type information to work. IDataReader可能是动态的,但它需要添加大量手动编码的类型信息才能工作。 In CsvDataReader most methods just forward the call to the underlying CsvReader, eg :CsvDataReader大多数方法只是将调用转发到底层 CsvReader,例如:

public long GetInt64(int i)
{
            return csv.GetField<long>(i);
}

public string GetName(int i)
{
    return csv.Configuration.HasHeaderRecord
        ? csv.HeaderRecord[i]
        : string.Empty;
}

But GetSchemaTable() is 70 lines, with defaults that aren't optimal.但是GetSchemaTable()是 70 行,默认值不是最优的。 Why use sting as the column type when the parser can already parse date and numeric data for example?例如,当解析器已经可以解析日期和数字数据时,为什么要使用sting作为列类型?

One way to get around this is to create a new custom IDataReader using a copy of the previous reader's Schema Table and adding the extra columns.解决此问题的一种方法是创建一个新的自定义IDataReader使用先前阅读器的架构表的副本并添加额外的列。 CsvDataReader 's constructor accepts a DataTable schemaTable parameter to handle cases where its own GetSchemaTable isn't good enough. CsvDataReader的构造函数接受一个DataTable schemaTable参数来处理它自己的GetSchemaTable不够好的情况。 That DataTable could be modified to add extra columns :可以修改该DataTable以添加额外的列:

    /// <param name="csv">The CSV.</param>
    /// <param name="schemaTable">The DataTable representing the file schema.</param>
    public CsvDataReader(CsvReader csv, DataTable schemaTable = null)
    {
        this.csv = csv;

        csv.Read();

        if (csv.Configuration.HasHeaderRecord)
        {
            csv.ReadHeader();
        }
        else
        {
            skipNextRead = true;
        }

        this.schemaTable = schemaTable ?? GetSchemaTable();
    }

A DerivedColumnReader could be created that does just that in its constructor :可以在其构造函数中创建一个DerivedColumnReader来执行此操作:

public DerivedColumnReader<TSource,TResult>(string sourceName, string targetname,Fun<TSource,TResult> func,DataTable schemaTable)
{
...
  AddSchemaColumn(schemaTable);
  _schemaTable=schemaTable;
}

void AddSchemaColumn(DataTable dt,string targetName)
{
    var row = dt.NewRow();
    row["AllowDBNull"] = true;
    row["BaseColumnName"] = targetName;
    row["ColumnName"] = targetName;
    row["ColumnMapping"] = MappingType.Element;              
    row["ColumnOrdinal"] = dt.Rows.Count+1;
    row["DataType"] = typeof(TResult);

    //20-30 more properties
    dt.Rows.Add(row);
}

That's a lot of boiler plate that's eliminated with LINQ. LINQ 消除了很多样板。

Just providing closure to this.只是为此提供关闭。 So the main question really is to how we can avoid running into out of memory exceptions when fetching data from SQL without employing FETCH and OFFSET in the stored procedure.所以真正的主要问题是我们如何在从 SQL 获取数据时避免在存储过程中不使用 FETCH 和 OFFSET 时遇到内存不足的异常。 The resolution didn't require getting fancy with a custom Reader similar to SqlDataReader, but adding count checking and calling SqlBulkCopy in batches.该解决方案不需要使用类似于 SqlDataReader 的自定义 Reader,而是添加计数检查和批量调用 SqlBulkCopy。 The code is similar to what's written below:代码类似于下面写的:

using (var dataReader = sqlCmd.ExecuteReader(CommandBehavior.SequentialAccess))
                    {
                        int rowCount = 0;

                        while (dataReader.Read())
                        {
                            DataRow dataRow = SourceDataSet.Tables[source.ObjectName].NewRow();
                            for (int i = 0; i < SourceDataSet.Tables[source.ObjectName].Columns.Count; i++)
                            {
                                dataRow[(SourceDataSet.Tables[source.ObjectName].Columns[i])] = dataReader[i];
                            }
                            SourceDataSet.Tables[source.ObjectName].Rows.Add(dataRow);
                            rowCount++;

                            if (rowCount % recordLimitPerBatch == 0)
                            {
                                // Apply our field mapping
                                ApplyFieldMapping();

                                // Write it up
                                WriteRecordsIntoDestinationSQLObject();

                                // Remove from our dataset once we get to this point
                                SourceDataSet.Tables[source.ObjectName].Rows.Clear();
                            }
                        }
}

Where ApplyFieldMapping() makes field-specific changes to the contents of the datatable, and WriteRecordsIntoDestinationSqlObject().其中 ApplyFieldMapping() 对数据表的内容和 WriteRecordsIntoDestinationSqlObject() 进行特定于字段的更改。 This allowed us to call the stored procedure just once to fetch the data, and let the loop keep memory in check by writing records out and clearing them afterwards when we hit a preset recordLimitPerBatch.这允许我们只调用一次存储过程来获取数据,并让循环通过写出记录并随后在我们达到预设的 recordLimitPerBatch 时清除它们来检查内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM