简体   繁体   中英

Read millions of records from SQL, process them and insert them to another SQL server

I have a table with the following structure:

CREATE TABLE [dbo].[case_waveform_data] (
    [case_id]                INT             NOT NULL,
    [channel_index]          INT             NOT NULL,
    [seconds_between_points] REAL            NOT NULL,
    [last_time_stamp]        DATETIME        NOT NULL,
    [value_array]            VARBINARY (MAX) NULL
);

This table will contain millions of records. I want to read in data by case_id and channel_index from one database then group 5 minutes worth of them together by decompressing the value_array data and linking them together, compress that stream then add the combined chunk to another database.

The code I have works fine on up about 100k records. After that I get random errors such as out of memory in system.data, CRC does not match on compressed/uncompressd data, invalid character in compressed data. It will happen randomly if I go larger than 100k.

I was using linq to loop through the records but then switched to the SqlDataReader directly. To import the records I use the SqlBulkCopy but I have found I get the error even with that part commented out. It appears that if I write out each combined record to a file as an insert the code will complete but if I start collecting the combined records into a list that I pass over to the SqlBulkCopy for the insert I will get the ranom errors. Most of the time it is an out of memory error on the reader.Read() (or the foreach (var record in records) line when using linq). The memory on the process itself is around 80MB for working set, private and commit.

Any ideas on what I am doing wrong? IS there a better way to accomplish this? If I use the file I write out it will be up to ~300MB, Can I load a file that size?

Here is the whole function. It has been rewritten about 20 times so there might be some strange code in there:

using (LiveORDataContext dc = new LiveORDataContext(LiveORDataManager.ConnectionString))
{
    dc.Log = Console.Out;
    dc.ObjectTrackingEnabled = false;

    Stopwatch sw = Stopwatch.StartNew();

    int recordcount = 0;
    // Increase the timeout to 10 minutes for really big cases
    dc.CommandTimeout = 600;
    //Dictionary<int, int> channelindexes = dc.case_waveform_datas.Where(d => d.case_id == livecaseid).GroupBy(d => d.channel_index).ToDictionary(d => d.Key, d => d.Count());

    // get a distict list of all the channel indexes we need to import for this case
    List<int> channelindexes = (from wd in  dc.case_waveform_datas
                                where wd.case_id == livecaseid
                                group wd by wd.channel_index into grp
                                select grp.Key)
                               .ToList();

    // Loop through each channel's data for the case, combine it and compress it
    foreach (int channel in channelindexes)
    {
        List<case_waveform_data> wavedatalist = new List<case_waveform_data>();
        int warehouserecordcount = 0;
        float secondsbetweenpoints = float.NaN;
        DateTime lastaddedrecordtime = DateTime.MinValue;
        DateTime previoustime = DateTime.MinValue;
        List<float> wfpoints = new List<float>();

        string queryString = String.Format("SELECT case_id, channel_index, last_time_stamp, seconds_between_points, " +
                                           "value_array FROM case_waveform_data " +
                                           "WHERE case_id = {0} and channel_index = {1} " +
                                           "ORDER BY last_time_stamp", 
            livecaseid, channel);

        using (SqlConnection connection = new SqlConnection(LiveORDataManager.ConnectionString))
        {
            SqlCommand command = new SqlCommand(queryString, connection);
            connection.Open();

            SqlDataReader reader = command.ExecuteReader();

            // Call Read before accessing data. 
            while (reader.Read()) // Currently fails here
            {
                var item = new
                {
                   case_id = reader.GetInt32(0),
                   channel_index = reader.GetInt32(1),
                   last_time_stamp = reader.GetDateTime(2),
                   seconds_between_points = reader.GetFloat(3),
                   value_array = (byte[])reader["value_array"]
                };                    

        //var wdlist = from wfd in dc.case_waveform_datas
        //    where wfd.case_id == livecaseid && wfd.channel_index == channel
        //    orderby wfd.last_time_stamp
        //    select new
        //           {
        //               wfd.case_id,
        //               wfd.channel_index,
        //               wfd.last_time_stamp,
        //               wfd.seconds_between_points,
        //               wfd.value_array
        //           };

        // Loop through each channel and create floating point arrays that are larger than 
        // per second groups.    
        //foreach (var item in wdlist)
        //{
            // Get a record count for the info log
            recordcount++;

            if (float.IsNaN(secondsbetweenpoints))
            {
                secondsbetweenpoints = item.seconds_between_points > 0.0f
                    ? item.seconds_between_points
                    : 0.002f;
            } // assume .002 as a default if this is not set

            if (lastaddedrecordtime == DateTime.MinValue)
            {
                lastaddedrecordtime = item.last_time_stamp;
            }
            if (previoustime == DateTime.MinValue)
            {
                previoustime = item.last_time_stamp;
            }

            if ((secondsbetweenpoints != item.seconds_between_points && item.seconds_between_points > 0.0f) ||
                item.last_time_stamp > lastaddedrecordtime.AddMinutes(5))
            {
                // The seconds between points has changed so gzip the array of 
                // floats and insert the record.
                var ms = new MemoryStream();
                using (var gZipStream = new GZipStream(ms, CompressionMode.Compress))
                {
                    new BinaryFormatter().Serialize(gZipStream, wfpoints.ToArray());
                }

                // add the new combined record to a list that will be bulk inserted every 1000 records
                wavedatalist.Add(
                    //dcwarehouse.case_waveform_datas.InsertOnSubmit(
                    new case_waveform_data
                    {
                        case_id = warehousecaseid,
                        channel_index = channel,
                        seconds_between_points = secondsbetweenpoints,
                        last_time_stamp = previoustime,
                        value_array = ms.ToArray()
                    });
                if (writeFile) { writer.WriteLine("(@caseid, {0}, {1}, '{2}', 0x{3}),", channel, secondsbetweenpoints, previoustime, BitConverter.ToString(ms.ToArray()).Replace("-", string.Empty)); }
                ms.Close();
                wfpoints.Clear();
                secondsbetweenpoints = item.seconds_between_points;
                lastaddedrecordtime = item.last_time_stamp;

                // To keep memory down submit the changes to the warehouse database more often
                // than after the whole channel's data has been prepared. This handles cases
                // that have run for multiple days
                warehouserecordcount++;
                if (warehouserecordcount > 300)
                {
                    BulkInsertAll(wavedatalist);
                    wavedatalist.Clear();
                    warehouserecordcount = 0;
                    Console.WriteLine("Recordcount: {0}", recordcount);
                }
            }

            // Decompress the float values and append them
            var ms1 = new MemoryStream(item.value_array);
            using (var gZipStream = new GZipStream(ms1, CompressionMode.Decompress))
            {
                // Decompress the float array
                float[] wd = (float[])new BinaryFormatter().Deserialize(gZipStream);

                // determine the timestamp of the first float given the timestamp of the last float,
                // the number of elements and the seconds between floats
                var listfirsttimestamp =
                    item.last_time_stamp.AddSeconds((wd.Length - 1) * secondsbetweenpoints * -1);

                // if the last time of the previous list + the seconds between is still 
                // less than the new list's first time then add in NaNs
                while (previoustime.AddSeconds(secondsbetweenpoints) < listfirsttimestamp)
                {
                    wfpoints.Add(float.NaN);
                    previoustime = previoustime.AddSeconds(secondsbetweenpoints);
                }

                // now append the list
                wfpoints.AddRange(wd);
            }
            ms1.Close();
            previoustime = item.last_time_stamp;

        //}
            }

            // Call Close when done reading.
            reader.Close();
        }
        // If there are any points left for the channel add them here
        if (wfpoints.Any())
        {
            var ms = new MemoryStream();
            using (var gZipStream = new GZipStream(ms, CompressionMode.Compress))
            {
                new BinaryFormatter().Serialize(gZipStream, wfpoints.ToArray());
            }

            wavedatalist.Add(
                new case_waveform_data
                {
                    case_id = warehousecaseid,
                    channel_index = channel,
                    seconds_between_points = secondsbetweenpoints,
                    last_time_stamp = previoustime,
                    value_array = ms.ToArray()
                });
            if (writeFile) { writer.WriteLine("(@caseid, {0}, {1}, '{2}', 0x{3}),", channel, secondsbetweenpoints, previoustime, BitConverter.ToString(ms.ToArray()).Replace("-", string.Empty)); }
            ms.Close();
        }

        if (wavedatalist.Count > 0)
        {
            BulkInsertAll(wavedatalist);
            wavedatalist.Clear();
        }
        Console.WriteLine("Recordcount: {0}", recordcount);
    }

    sw.Stop();
    logger.Info("Livecase: [{0}], Warehouse Caseid: [{1}], Recordcount: [{2}]. Waveform data import took [{3}ms]",
        livecaseid, warehousecaseid, recordcount, sw.ElapsedMilliseconds);
}

if (writeFile)
{
    writer.Close();
}

EDIT: Here is one of the errors. It happens on this line:

 var item = new
               {
                   case_id = reader.GetInt32(0),
                   channel_index = reader.GetInt32(1),
                   last_time_stamp = reader.GetDateTime(2),
                   seconds_between_points = reader.GetFloat(3),
                   value_array = (byte[])reader["value_array"]
               };

and here is the stack trace:

System.InvalidOperationException - Internal connection fatal error.
at System.Data.SqlClient.TdsParserStateObject.TryProcessHeader()
at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
at System.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Byte[] buff, Int32 offset, Int32 len, Int32& totalRead)
at System.Data.SqlClient.TdsParserStateObject.TryReadPlpBytes(Byte[]& buff, Int32 offst, Int32 len, Int32& totalBytesRead)
at System.Data.SqlClient.TdsParser.TryReadSqlValue(SqlBuffer value, SqlMetaDataPriv md, Int32 length, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlDataReader.TryReadColumnInternal(Int32 i, Boolean readHeaderOnly)
at System.Data.SqlClient.SqlDataReader.TryReadColumn(Int32 i, Boolean setTimeout, Boolean allowPartiallyReadColumn)
at System.Data.SqlClient.SqlDataReader.GetValueInternal(Int32 i)
at System.Data.SqlClient.SqlDataReader.GetValue(Int32 i)
at System.Data.SqlClient.SqlDataReader.get_Item(String name)
at LiveOR.Data.AccessLayer.LiveORDataManager.ImportWaveformDataLiveToWarehouse(Int32 livecaseid, Int32 warehousecaseid, String backupfilepath) in c:\SRC\LiveOR\LiveOR.Data\LiveORDataManager.cs:line 2416
at VisionSupport.Scheduler.Start() in c:\SRC\LiveOR\VisionSupport\Scheduler.cs:line 90

The OutOfMemoryException happens on the above line as well. Here is that stack trace:

at System.Data.SqlClient.TdsParserStateObject.TryReadPlpBytes(Byte[]& buff, Int32 offst, Int32 len, Int32& totalBytesRead)
at System.Data.SqlClient.TdsParser.TryReadSqlValue(SqlBuffer value, SqlMetaDataPriv md, Int32 length, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlDataReader.TryReadColumnInternal(Int32 i, Boolean readHeaderOnly)
at System.Data.SqlClient.SqlDataReader.TryReadColumn(Int32 i, Boolean setTimeout, Boolean allowPartiallyReadColumn)
at System.Data.SqlClient.SqlDataReader.GetValueInternal(Int32 i)
at System.Data.SqlClient.SqlDataReader.GetValue(Int32 i)
at System.Data.SqlClient.SqlDataReader.get_Item(String name)
at LiveOR.Data.AccessLayer.LiveORDataManager.ImportWaveformDataLiveToWarehouse(Int32 livecaseid, Int32 warehousecaseid, String backupfilepath) in c:\SRC\LiveOR\LiveOR.Data\LiveORDataManager.cs:line 2419

EDIT 2:

Here is another random one. I get these just by rerunning the same code.

Line:

float[] wd = (float[])new BinaryFormatter().Deserialize(gZipStream);

Exception:

SerializationException: Binary stream '75' does not contain a valid BinaryHeader. Possible causes are invalid stream or object version change between serialization and deserialization.

Stack Trace:

at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run()
at System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream)
at LiveOR.Data.AccessLayer.LiveORDataManager.ImportWaveformDataLiveToWarehouse(Int32 livecaseid, Int32 warehousecaseid, String backupfilepath) in c:\SRC\LiveOR\LiveOR.Data\LiveORDataManager.cs:line 2516

Try putting var ms = new MemoryStream(); in a using block

See the documentation on MemoryStream

MemoryStream

Closes the current stream and releases any resources (such as sockets and file handles) associated with the current stream. Instead of calling this method, ensure that the stream is properly disposed. (Inherited from Stream.)

Stream.Close

You can declare Stream objects within a using block (or Using block in Visual Basic) to ensure that the stream and all of its resources are disposed, or you can explicitly call the Dispose method.

I originally was going to suggest you weren't closing the Reader, but I see the Close() now.

There is a lot going on here, and definitely the streams would be the first place to look, but ADO.Net 4.5 has some new features that let you read the data columns sequentially within each row without buffering them, and additionally, let you read a byte array without buffering it in memory.

Might be worth a read.

the easier way is only one way Sqlbulkcopy and reflection with entity framework:

Start with filter data with take and skip in batch of 2000/3000/5000 records;

then using reflection to get a datatable and pass it to sqlbulkcopy within a transaction and using transaction to prevent issue.

Log each transaction so you will know which record has not been imported in case of fails.

Proceed in that manner on until you have complete your task.It will take a really short time.

Here an example to retrieve a datatable from an entities = take care that the list of object to pass to the following function is Ienumerable so when you have filter your data and used .ToList don forget to call .asEnumerable with statement like this :

  Lstads.Select(Function(x) x).AsEnumerable

so you may pass to this function the result of your previous query

   Public Function EQToDataTable(ByVal parIList As System.Collections.IEnumerable) As System.Data.DataTable
    Dim ret As New System.Data.DataTable()
    Try
        Dim ppi As System.Reflection.PropertyInfo() = Nothing
        If parIList Is Nothing Then Return ret
        For Each itm In parIList
            If ppi Is Nothing Then
                ppi = DirectCast(itm.[GetType](), System.Type).GetProperties()
                For Each pi As System.Reflection.PropertyInfo In ppi
                    Dim colType As System.Type = pi.PropertyType

                    If (colType.IsGenericType) AndAlso
                       (colType.GetGenericTypeDefinition() Is GetType(System.Nullable(Of ))) Then colType = colType.GetGenericArguments()(0)

                    ret.Columns.Add(New System.Data.DataColumn(pi.Name, colType))
                Next
            End If
            Dim dr As System.Data.DataRow = ret.NewRow
            For Each pi As System.Reflection.PropertyInfo In ppi
                dr(pi.Name) = If(pi.GetValue(itm, Nothing) Is Nothing, DBNull.Value, pi.GetValue(itm, Nothing))
            Next
            ret.Rows.Add(dr)
        Next
        For Each c As System.Data.DataColumn In ret.Columns
            c.ColumnName = c.ColumnName.Replace("_", " ")
        Next
    Catch ex As Exception
        ret = New System.Data.DataTable()
        Dim lg As New EADSCORE.Helpers.CustomLogger(False)
        lg.WriteLog(ex)
    End Try
    Return ret
End Function

here an example to use sqlbulkcopy with transaction

   Public Sub BulkInserTest(ByVal list As System.Collections.IEnumerable)
    Dim hasElement = False
    For Each el In list
        hasElement = True
        Exit For
    Next
    If hasElement = True Then
        Dim dt As DataTable = EQToDataTable(list)

        Using cnn As New SqlClient.SqlConnection(ConfigurationManager.ConnectionStrings("BUCLCNN").ConnectionString)
            cnn.Open()
            Using tr As SqlClient.SqlTransaction = cnn.BeginTransaction
                Using sqlbulk As New SqlClient.SqlBulkCopy(cnn, SqlBulkCopyOptions.KeepIdentity, tr)
                    With sqlbulk
                        .DestinationTableName = "Ads"
                        .BatchSize = 2500
                        For Each el As DataColumn In dt.Columns
                            If el.ColumnName = "IDAds" Or el.ColumnName = "Province" Or el.ColumnName = "SubCategory" Or el.ColumnName = "AdsComments" Or el.ColumnName = "CarDetails" Or el.ColumnName = "HomeDetails" Or el.ColumnName = "Images" Or el.ColumnName = "Customer" Then
                                //not execute
                            Else
                                Dim map As New SqlBulkCopyColumnMapping(el.ColumnName, el.ColumnName)
                                .ColumnMappings.Add(map)
                            End If
                        Next
                        Try
                            If dt.Rows.Count > 0 Then
                                .WriteToServer(dt)
                                tr.Commit()
                            End If
                        Catch ex As Exception
                            tr.Rollback()
                            Dim lg As New EADSCORE.Helpers.CustomLogger(False)
                            lg.WriteLog(ex)
                        End Try
                    End With
                End Using
            End Using
            Dim cmd As New SqlCommand("Update Ads Set Article=replace(Article,'&amp;','&');Update Ads Set Title=replace(Article,'&amp;','&')", cnn)
            cmd.ExecuteNonQuery()
        End Using
    End If

End Sub

The code above must be modified coz there's some addition filter like if and so on my need but it work as well :)

Enjoy

NOTE: I don't know which are your entity type so you have to check the mapping in order to bge sure that all work fine :)

If it solve your issue please mark it as answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM