使用datareader讀取數百萬數據時如何避免數據庫連接丟失問題？

Question

我有一個從數據庫表中讀取數據的類庫。現在這個數據庫表是客戶端數據庫，我的應用程序只有連接字符串和 sql 查詢來打開連接、執行 sql 查詢、讀取數據和執行一些操作。 這個操作是什么，它有點復雜（基本上是業務規則）。

現在用戶以特定格式提交 sql 查詢，我的類庫知道從該 sql 查詢結果中選擇哪些列。

我不知道我的類庫將處理的記錄數。 它也可能是100,200 或數百萬的數據。

目前類庫正在處理90 millions of data which resides on oracle 。 我正在使用SQLDATAREADER讀取這些數據。

現在的問題是為了避免內存異常，我正在使用 sql 數據讀取器讀取數據，但是 1 乘 1 讀取 9000 萬條數據，然后對每條記錄執行一些操作，連接將在這段時間內保持打開狀態，目前我面臨的問題是連接丟失：

ORA-03135: connection lost contact

1 解決方案可能是按塊讀取數據，但正如我所說，我不知道我可以處理的記錄數量，而且 SQL 查詢不在我手中，因為它是由用戶提交的，而我的類庫會提取它。

我可以做些什么來避免連接問題？

更新：

public class LongRunningTask : IDisposable
{
        public void Start(DbConnection connection, string sql)
        {
            using (var cmd = connection.CreateCommand())
            {
                cmd.CommandText = sql;
                cmd.CommandTimeout = 0;
                connection.Open();
                using (var dr = cmd.ExecuteReader(CommandBehavior.CloseConnection))
                {
                    //read 1 by 1 record and pass it to algorithm to do some complex processing
                }
            }
        }
}

算法不慢，這不是問題。主要問題是在ORACLE當前9000萬條數據的情況下讀取部分很慢。

我已經針對 SQL SERVER 測試了 1 億條數據，我沒有遇到這個問題（盡管有時傳輸層錯誤），盡管這個過程花費了很多時間。我只在 ORACLE 上遇到這個問題。

Answer 1

將數據閱讀器打開數小時並不是一個好主意。 即使一切配置正確，線路上的某處也可能存在暫時性錯誤（如您提到的傳輸層錯誤）。

您可以在客戶端代碼中添加重試邏輯以使其更加健壯。 這樣做的一種方法是跟蹤最后處理的記錄，並在連接失敗時嘗試重新連接並從該位置“恢復”。

private const int MAX_RETRY = 10;
private const int RETRY_INTERVAL_MS = 1000;
private string lastProcessedPosition = null;

public void Start(string connectionString, string sql)
{
    var exceptions = new List<Exception>();
    for (var i = 0; i < MAX_RETRY; i++)
    {
        try
        {
            if (Process(connString, sql, lastProcessedPosition)) return;
        }
        catch(Exception ex)
        {
            exceptions.Add(ex);
        }
        System.Threading.Thread.Sleep(RETRY_INTERVAL_MS);
    }
    throw new AggregateException(exceptions);
}

您的Process()方法將重新連接並跳過已處理的行：

public bool Process(string connString, string sql, string resumeFromPosition = null)
{
    using ()// init your connection, command, reader
    {
        if (resumeFromPosition != null)
        {
            while (dr.Read() && dr.ToPositionString() != resumeFromPosition)
            {
                // skipping already processed records
            }
        }
        while (dr.Read)
        {
            // Do your complex processing

            // You can do this every N records if accuracy is not critical
            lastProcessedPosition = dr.ToPositionString();
        }
    }
    return true;
}

dr.ToPositionString()是您創建的一種擴展方法，用於根據您的表架構使行變得唯一。

Answer 2

這個解決方案是我過去用來從數據庫中讀取大型數據集的解決方案，但將它們分塊處理：

首先，我選擇實現一個獲取數據庫連接的方法。 請注意，我將 ConnectionTimeout 設置為 0，因為我知道此過程將長時間運行。

private static OracleConnection GetConnection()
{
    return new OracleConnection(new OracleConnectionStringBuilder
    {
        //TODO: Set other connection string properties
        ConnectionTimeout = 0
    }.ConnectionString);
}

接下來，我想要一些從數據庫讀取數據的通用“GetData”方法。 請注意，它的返回類型明確為“IEnumerable”。 您可以強類型化它而不是通用，但它需要保持返回 IEnumerable 以利用“收益回報”。

另請注意，我已設置 CommandTimeout = 0，因為我知道此過程將長時間運行。

public static IEnumerable<T> GetData<T>(string sql)
{
    using (var conn = GetConnection())
    {
        if (ConnectionState.Closed == conn.State) conn.Open();

        using (var cmd = conn.CreateCommand())
        {
            cmd.CommandTimeout = 0;
            cmd.CommandType = CommandType.Text;
            cmd.CommandText = sql; //TODO: Make sure you do standard sql injection prevention

            using (var reader = cmd.ExecuteReader())
            {
                //We want to optimize the number of round trips to the DB our reader makes.
                //Setting the FetchSize this way will make the reader bring back 5000 records
                //with every trip to the DB
                reader.FetchSize = reader.RowSize * 5000;

                while (reader.Read())
                {
                    var values = new object[reader.FieldCount];
                    reader.GetValues(values);
                    //This assumes that type T has a constructor that takes in an object[]
                    //and the mappings of object[] to properties is done in that constructor
                    yield return (T)Activator.CreateInstance(typeof(T), new object[] { values });
                }
            }
        }
    }
}

接下來我想有一些方法來實現奇特的業務邏輯/算法：

public static void ProcessBusinessLogic<T>(IEnumerable<T> data)
{
    //TODO Implement fancy business logic here
}

最后，我需要一個方法來包裝這兩種方法並將它們一起使用。 我還需要一種方法來確保我正在處理“塊”中的記錄，這樣我就不會嘗試將數百萬條記錄加載到內存中並使系統崩潰。

為了分塊處理數據，我將 MoreLinq nuget 庫中的一個類用於 MoreEnumerable.Batch。 這將允許我將 GetData 方法的結果“分塊”成更合適大小的“批次”。 利用這一點，我將能夠確保我永遠不會將比我設置的批處理大小更多的記錄加載到內存中。

public static void Main(string[] args)
{
    foreach (var batch in GetData<string>("hello world").Batch(50000))
    {
        ProcessBusinessLogic(batch);
    }
}

所以，把整個事情放在一起，這個測試應用程序有 2 個類：

using System;
using System.Collections.Generic;
using System.Data;
using MoreLinq;
using Oracle.ManagedDataAccess.Client;

namespace ReadLargeDataset
{
    public class Program
    {
        public static void Main(string[] args)
        {
            foreach (var batch in GetData<string>("hello world").Batch(50000))
            {
                ProcessBusinessLogic(batch);
            }
        }

        public static void ProcessBusinessLogic<T>(IEnumerable<T> data)
        {
            //TODO Implement fancy business logic here
        }

        public static IEnumerable<T> GetData<T>(string sql)
        {
            using (var conn = GetConnection())
            {
                if (ConnectionState.Closed == conn.State) conn.Open();

                using (var cmd = conn.CreateCommand())
                {
                    cmd.CommandTimeout = 0;
                    cmd.CommandType = CommandType.Text;
                    cmd.CommandText = sql; //TODO: Make sure you do standard sql injection prevention

                    using (var reader = cmd.ExecuteReader())
                    {
                        //We want to optimize the number of round trips to the DB our reader makes.
                        //Setting the FetchSize this way will make the reader bring back 5000 records
                        //with every trip to the DB
                        reader.FetchSize = reader.RowSize * 5000;

                        while (reader.Read())
                        {
                            var values = new object[reader.FieldCount];
                            reader.GetValues(values);
                            //This assumes that type T has a constructor that takes in an object[]
                            //and the mappings of object[] to properties is done in that constructor
                            yield return (T)Activator.CreateInstance(typeof(T), new object[] { values });
                        }
                    }
                }
            }
        }

        private static OracleConnection GetConnection()
        {
            return new OracleConnection(new OracleConnectionStringBuilder
            {
                //TODO: Set other connection string properties
                ConnectionTimeout = 0
            }.ConnectionString);
        }
    }
}

和：

#region License and Terms
// MoreLINQ - Extensions to LINQ to Objects
// Copyright (c) 2009 Atif Aziz. All rights reserved.
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#endregion

// ReSharper disable CheckNamespace
namespace MoreLinq
{
    using System;
    using System.Collections.Generic;
    using System.Diagnostics;
    using System.Linq;

    public static class MoreEnumerable
    {
        /// <summary>
        /// Batches the source sequence into sized buckets.
        /// </summary>
        /// <typeparam name="TSource">Type of elements in <paramref name="source"/> sequence.</typeparam>
        /// <param name="source">The source sequence.</param>
        /// <param name="size">Size of buckets.</param>
        /// <returns>A sequence of equally sized buckets containing elements of the source collection.</returns>
        /// <remarks> This operator uses deferred execution and streams its results (buckets and bucket content).</remarks>

        public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(this IEnumerable<TSource> source, int size)
        {
            return Batch(source, size, x => x);
        }

        /// <summary>
        /// Batches the source sequence into sized buckets and applies a projection to each bucket.
        /// </summary>
        /// <typeparam name="TSource">Type of elements in <paramref name="source"/> sequence.</typeparam>
        /// <typeparam name="TResult">Type of result returned by <paramref name="resultSelector"/>.</typeparam>
        /// <param name="source">The source sequence.</param>
        /// <param name="size">Size of buckets.</param>
        /// <param name="resultSelector">The projection to apply to each bucket.</param>
        /// <returns>A sequence of projections on equally sized buckets containing elements of the source collection.</returns>
        /// <remarks> This operator uses deferred execution and streams its results (buckets and bucket content).</remarks>

        public static IEnumerable<TResult> Batch<TSource, TResult>(this IEnumerable<TSource> source, int size,
            Func<IEnumerable<TSource>, TResult> resultSelector)
        {
            if (source == null) throw new ArgumentNullException(nameof(source));
            if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
            if (resultSelector == null) throw new ArgumentNullException(nameof(resultSelector));
            return BatchImpl(source, size, resultSelector);
        }

        private static IEnumerable<TResult> BatchImpl<TSource, TResult>(this IEnumerable<TSource> source, int size,
            Func<IEnumerable<TSource>, TResult> resultSelector)
        {
            Debug.Assert(source != null);
            Debug.Assert(size > 0);
            Debug.Assert(resultSelector != null);

            TSource[] bucket = null;
            var count = 0;

            foreach (var item in source)
            {
                if (bucket == null)
                {
                    bucket = new TSource[size];
                }

                bucket[count++] = item;

                // The bucket is fully buffered before it's yielded
                if (count != size)
                {
                    continue;
                }

                // Select is necessary so bucket contents are streamed too
                yield return resultSelector(bucket.Select(x => x));

                bucket = null;
                count = 0;
            }

            // Return the last bucket with all remaining elements
            if (bucket != null && count > 0)
            {
                yield return resultSelector(bucket.Take(count));
            }
        }
    }
}

Answer 3

簡短的回答：

我以前遇到過這種情況，這是因為我公司網絡上的防火牆規則。

長答案和主動提供的建議：

我認為您遇到的主要問題是您的應用程序設計。 如果您要處理數百萬條記錄，可能需要很長時間……很長的時間取決於您必須做什么。
我開發了一個應用程序來加密數據庫中的 1 億個靜態卡號，花了 3 周時間完成。 處理真正的大數據很棘手； 我遇到了各種各樣的問題。 以下是我的一些建議。

1) 您會聽到您的問題出在超時設置上。 這可能不是它。 在我工作的地方，我們有防火牆規則，它會在一段時間（我不記得是 15 或 30 分鍾）后終止數據庫連接，我們花了數周時間才弄清楚為什么我們的連接會掉線。

2）一次拉回數百萬條記錄不是一個好主意。

3) 你應該在你的代碼中加入一些 SQL 注入預防。

4）我建議使用像實體框架這樣的 ORM，這使得循環和分塊更容易。

Answer 4

您不能獲取所有數據並將它們全部保存到某個內存對象中，然后釋放與數據庫的連接嗎？ 發布處理您復雜的subiness規則，一旦完成，您需要將此數據更新回數據庫再次打開連接並進行批量更新。

希望我說的有道理。

Answer 5

無論連接狀態如何，我都不會提倡將那么多數據放入內存中。 一方面，我不確定一個實體有多大？ 制作列時使用了哪些數據類型。

請確保以下幾點：

你真的需要整個實體來執行復雜的邏輯來執行（即運行業務規則）？ 除此之外，在一個完整的實體中確實存在多少列？
是否可以只獲取所需的數據？ （例如，僅需要根據您將映射或執行業務規則的列？盡量避免加載整個數據庫行，除非您確定每一列都參與其中。）
有沒有辦法將這種復雜的邏輯直接與數據庫記錄相關聯？ 如果是這樣，請將這些規則轉移到數據庫中，並使用存儲過程在內存中執行和計算這些規則。

如果我在你的地方，我會做混合的事情。 問題的某些部分將保留在數據庫中，而我認為有必要在內存中執行操作的地方，我將在那里執行操作。

貼出一些具體的、具體的邏輯和數據實體，方便大家分析問題。

Answer 6

像你看起來已經有了一個情況或者有到Oracle的粗略連接或有一些類型的超時后殺死你的進程政策（而不是在你的身邊-在服務器上）。 SQL Server 和 Oracle 都有查詢調控器的概念，它們可以以不同的方式配置。

有一些策略可以克服這些障礙……它們都圍繞着批量查詢。 問題是如何確保您從上次中斷的地方重新開始。

Oracle 和 SQL Server 都有offset和在查詢中fetch first n的想法。 這是專門為進行分頁查詢而設計的，這可能對您的情況有所幫助。

基本設置是：

select 
  columns... 
from 
  data sources...
where 
  some conditions...
offset @offset
fetch first @pageSize rows

..然后您可以安排您的過程重復運行選擇，並在您進行時為@offset 和@pageSize 傳遞新值。

如果您的查詢有所不同，但屬於常規選擇類型的語句，您可以在處理代碼中將offset和fetch子句附加到查詢中。

面對實際的連接問題，您必須使您的流程更加穩健，但這非常簡單。

您還希望盡可能縮短連接時間。 您要這樣做的原因是大多數數據庫都支持連接池，因此如果您讓輪詢器回收它們，則創建連接既便宜又容易。 此外，DBA 將cmd.CommandTimeout = 0;器放在他們的查詢上的原因是因為開發人員總是在做類似cmd.CommandTimeout = 0;類的事情cmd.CommandTimeout = 0;

public class LongRunningTask
{
  const long pageSize = 100000L; //--> ...or whatever the market will bear
  const int retryLimit = 3;
  public void Start( ConnectionFactory factory, string sql )
  {
    var done = false;
    var page = 0L;
    var index = 0L;
    var retries = 0;
    var retrying = false;
    while ( !done )
    {
      try
      {
        using ( var connection = factory.CreateConnection( ) )
        {
          using ( var cmd = connection.CreateCommand( ) )
          {
            cmd.CommandType = CommandType.Text;
            cmd.CommandText = sql;
            cmd.Parameters.Add( factory.CreateParameter( "@pageSize", SqlDbType.BigInt ) );
            cmd.Parameters.Add( factory.CreateParameter( "@offset", SqlDbType.BigInt ) );
            cmd.Parameters[ "@pageSize" ].Value = pageSize - ( retrying ? index : 0 );
            cmd.Parameters[ "@offset" ].Value = page + ( retrying ? index : 0 );
            connection.Open( );
            using ( var dr = cmd.ExecuteReader( ) )
            {
              index = retrying ? index : 0;
              retrying = false;
              done = !dr.HasRows; //--> didn't get anything, we're done!
              while ( dr.Read( ) )
              {
                //read 1 by 1 record and pass it to algorithm to do some complex processing
                index++;
              }
            }
          }
        }
        page++;
      }
      catch ( Exception ex )
      {
        Console.WriteLine( ex );
        if ( retryLimit < retries++ ) throw;
        retrying = true;
      }
    }
  }
}

public  class ConnectionFactory
{
  public DbConnection CreateConnection( )
  {
    return //... a DbConnection
  }
  public DbParameter CreateParameter( string parameterName, SqlDbType type, int length = 0 )
  {
    return //... a DbParameter
  }
}

Answer 7

您可以像這樣設置連接超時限制：

command.CommandTimeout = 60; //The time in seconds to wait for the command to execute. The default is 30 seconds.

Answer 8

我正在閱讀這篇文章，試圖找到另一個問題的答案。 你在這里有一個有趣的問題。

如果您不顯示正在檢索的數據並且對某些最終結果感興趣，那么最好的方法是編寫 Oracle 服務器端 PL/SQL 並在 PL/SQL 包中實現您的邏輯。

我之前用 PL/SQL 編寫了服務器端代碼，其中包含復雜的業務邏輯並生成不同級別的摘要信息。 然后客戶端程序將僅讀取該生成集。

使用datareader讀取數百萬數據時如何避免數據庫連接丟失問題？

問題描述

8 個解決方案

解決方案1
4 2019-03-11 23:55:59

解決方案2
3 2019-03-13 14:36:31

解決方案3
1 2019-03-12 18:24:34

解決方案4
1 2019-03-13 10:42:04

解決方案5
1 2019-03-16 19:18:58

解決方案6
1 2019-03-17 15:35:38

解決方案7
0 2019-03-07 12:02:34

解決方案8
0 2021-03-17 03:37:40

使用datareader讀取數百萬數據時如何避免數據庫連接丟失問題？

問題描述

8 個解決方案

解決方案1 4 2019-03-11 23:55:59

解決方案2 3 2019-03-13 14:36:31

解決方案3 1 2019-03-12 18:24:34

解決方案4 1 2019-03-13 10:42:04

解決方案5 1 2019-03-16 19:18:58

解決方案6 1 2019-03-17 15:35:38

解決方案7 0 2019-03-07 12:02:34

解決方案8 0 2021-03-17 03:37:40

解決方案1
4 2019-03-11 23:55:59

解決方案2
3 2019-03-13 14:36:31

解決方案3
1 2019-03-12 18:24:34

解決方案4
1 2019-03-13 10:42:04

解決方案5
1 2019-03-16 19:18:58

解決方案6
1 2019-03-17 15:35:38

解決方案7
0 2019-03-07 12:02:34

解決方案8
0 2021-03-17 03:37:40