如何對 IEnumerable 進行分塊<T> ，在失敗的情況下不會丟失/丟棄物品？

Question

我有一個生產者-消費者場景，其中生產者是一個可枚舉的項目序列（ IEnumerable<Item> ）。 我想以每個 10 個項目的塊/批次處理這些項目。 所以我決定使用新的（.NET 6） Chunk LINQ 運算符，正如這個問題中所建議的那樣：在 LINQ 中創建批處理。

我的問題是，有時生產者會失敗，在這種情況下，分塊序列的消費者會收到錯誤，而不會首先收到包含錯誤之前生成的最后一個項目的塊。 因此，例如，如果生產者生成 15 個項目然后失敗，則消費者將獲得包含項目 1-10 的塊，然后將獲得異常。 11-15項將丟失！ 這是一個演示這種不良行為的最小示例：

static IEnumerable<int> Produce()
{
    int i = 0;
    while (true)
    {
        i++;
        Console.WriteLine($"Producing #{i}");
        yield return i;
        if (i == 15) throw new Exception("Oops!");
    }
}

// Consume
foreach (int[] chunk in Produce().Chunk(10))
{
    Console.WriteLine($"Consumed: [{String.Join(", ", chunk)}]");
}

輸出：

Producing #1
Producing #2
Producing #3
Producing #4
Producing #5
Producing #6
Producing #7
Producing #8
Producing #9
Producing #10
Consumed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Producing #11
Producing #12
Producing #13
Producing #14
Producing #15
Unhandled exception. System.Exception: Oops!
   at Program.<Main>g__Produce|0_0()+MoveNext()
   at System.Linq.Enumerable.ChunkIterator[TSource](IEnumerable`1 source, Int32 size)+MoveNext()
   at Program.Main()

在線演示。

理想的行為是在獲取異常之前獲取值為[11, 12, 13, 14, 15]的塊。

我的問題是：有什么方法可以配置Chunk運算符，使其優先發送數據而不是異常？ 如果不是，我怎樣才能實現一個自定義的 LINQ 運算符，例如ChunkNonDestructive ，並具有理想的行為？

public static IEnumerable<TSource[]> ChunkNonDestructive<TSource>(
    this IEnumerable<TSource> source, int size);

注意：除了System.Linq.Chunk操作符之外，我還嘗試了System.Interactive包中的Buffer操作符，以及MoreLinq包中的Batch操作符。 顯然，它們的行為都相同（破壞性地）。

更新：這是上述示例的理想輸出：

Producing #1
Producing #2
Producing #3
Producing #4
Producing #5
Producing #6
Producing #7
Producing #8
Producing #9
Producing #10
Consumed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Producing #11
Producing #12
Producing #13
Producing #14
Producing #15
Consumed: [11, 12, 13, 14, 15]
Unhandled exception. System.Exception: Oops!
   at Program.<Main>g__Produce|0_0()+MoveNext()
   at System.Linq.Enumerable.ChunkIterator[TSource](IEnumerable`1 source, Int32 size)+MoveNext()
   at Program.Main()

不同之處在於Consumed: [11, 12, 13, 14, 15]行，實際輸出中不存在該行。

Answer 1

首先，語義問題。 Chunk或Buffer或其他任何東西都沒有破壞性，它只是從可枚舉的源中讀取項目，直到它結束或拋出異常。 代碼中唯一具有破壞性的事情是拋出異常，其行為與預期一致（即，將堆棧從生成器、Linq 函數中展開，並進入代碼中的 catch，如果存在的話）。

此外，每個 Linq 函數在異常方面的行為都應該立即顯而易見。 事實上，異常是如何工作的，圍繞它們來支持您的用例是相對昂貴的：您需要為生成的每個項目吞下異常。 在我看來，這是一個非常糟糕的設計，如果你為我工作並且這樣做，你就會被當場解雇。

有了所有這些，編寫一個像這樣的BadDesignChunk是微不足道的（如果昂貴的話）：

public static IEnumerable<IEnumerable<TSource>> BadDesignChunk<TSource>(this IEnumerable<TSource> source, int size)
{
    var chunk = new List<TSource>();
    using var enumerator = source.GetEnumerator();
    
    while(true)
    {
        while(chunk.Count < size)
        {
            try
            {
                if(!enumerator.MoveNext())
                {
                    // end of the stream, send what we have and finish
                    goto end;
                }
            }
            catch
            {
                // exception, send what we have and finish
                goto end;
            }
            
            chunk.Add(enumerator.Current);
        }
        
        // chunk full, send it
        yield return chunk;
        chunk.Clear();
    }
    
    end:
    if(chunk.Count > 0)
        yield return chunk;
}

在這里查看它的實際應用。

Answer 2

如果您對源進行預處理以使其在遇到異常時停止，那么您可以按原樣使用Chunk() 。

public static class Extensions
{
    public static IEnumerable<T> UntilFirstException<T>(this IEnumerable<T> source, Action<Exception> exceptionCallback = null)
    {
        var enumerator = source.GetEnumerator();
        while(true)
        {
            try
            {
                if(!enumerator.MoveNext())
                {
                    break;
                }
            } catch (Exception e) {
                exceptionCallback?.Invoke(e);
                break;
            }
            yield return enumerator.Current;
        }
    }
}

    Exception? e = null;
    foreach (int[] chunk in Produce().UntilFirstException(thrown => e = thrown).Chunk(10))
    {
        Console.WriteLine($"Consumed: [{String.Join(", ", chunk)}]");
    }

我覺得這樣可以很好地分開職責。 如果您想要一個拋出異常而不是自己捕獲異常的助手，您可以使用它作為組件來簡化編寫該助手的過程：

    public static IEnumerable<T[]> ChunkUntilFirstException<T>(this IEnumerable<T> source, int size)
    {
        Exception? e = null;
        var result = source.UntilFirstException(thrown => e = thrown).Chunk(size);
        foreach (var element in result)
        {
            yield return element;
        }
        if (e != null)
        {
            throw new InvalidOperationException("source threw an exception", e);
        }
    }

請注意，這將引發與生產者發出的異常不同的異常。 這使您可以保持與原始異常關聯的堆棧跟蹤，而throw e將覆蓋該堆棧跟蹤。

您可以根據自己的需要進行調整。 如果您需要捕獲您希望生產者發出的特定類型的異常，則可以很容易地將when上下文關鍵字與一些模式匹配一起使用。

    try
    {
        foreach (int[] chunk in Produce().ChunkUntilFirstException(10))
        {
            Console.WriteLine($"Consumed: [{String.Join(", ", chunk)}]");
        }
    }
    catch (InvalidOperationException e) when (e.InnerException is {Message: "Oops!"})
    {
        Console.WriteLine(e.InnerException.ToString());
    }

Answer 3

我看不出如何以明智的方式做到這一點（不忽略異常）。

根據我閱讀您的帖子的方式，我相信您是在要求不可能的事情：保留異常但將其延遲到稍后的時間，但要保證異常會發生。 如果您可以不時從生產者那里刪除異常，我會說您應該可以始終忽略它，並且忽略異常的簡單包裝器就足夠了。

在假設的實現中如何丟棄異常：要獲得一個塊，您需要枚舉項目並保存最后一個項目的異常。 在這一點上，有兩個地方可以拋出異常：在迭代一個塊時（如果像 First used 之類的東西或在處理現有項目時拋出另一個異常，則不一定會發生）或在獲取下一個塊時（由於相同的原因不一定會發生））。如果您以某種方式（通過代碼審查和測試？）保證所有序列都迭代到最后，您就會遇到異常順序問題 - 第一個異常（來自提供者）將在處理塊中發生異常之后發生。

對於常規的“塊”API，人們可以想到“在迭代塊時直接讀取項目怎么樣”（這會在您即將獲取該項目時在適當的時間拋出異常）對於常規“塊”API來說是不可能的：該API為您提供多個“指針”到可以同時推進的源序列中 - 結果唯一的選擇是實現每個塊（從而觸發異常），因此只有一個活的“指針”進入源序列（指向第一個元素之前的下一個讀取塊）。

Answer 4

我受到了 StriplingWarrior 的回答的啟發，這是基於我最初並不理解的一個想法。 基於這個想法，我編寫了一個通用方法，它可以根據以下規則對各種 LINQ 運算符或運算符組合進行魯棒性：

如果source序列失敗，則在產生output序列的所有元素后傳播錯誤。

private static IEnumerable<TResult> DeferSourceException<TSource, TResult>(
    IEnumerable<TSource> source,
    Func<IEnumerable<TSource>, IEnumerable<TResult>> conversion)
{
    ExceptionDispatchInfo edi = null;
    IEnumerable<TSource> SourceIterator()
    {
        using var enumerator = source.GetEnumerator();
        while (true)
        {
            TSource item;
            try
            {
                if (!enumerator.MoveNext()) break;
                item = enumerator.Current;
            }
            catch (Exception ex)
            {
                edi = ExceptionDispatchInfo.Capture(ex);
                break;
            }
            yield return item;
        }
    }
    IEnumerable<TResult> output = conversion(SourceIterator());
    foreach (TResult result in output) yield return result;
    edi?.Throw();
}

然后我使用DeferSourceException方法來實現ChunkNonDestructive運算符，如下所示：

/// <summary>
/// Splits the elements of a sequence into chunks of the specified size.
/// In case the sequence fails and there are buffered elements, a last chunk
/// that contains these elements is emited before propagating the error.
/// </summary>
public static IEnumerable<TSource[]> ChunkNonDestructive<TSource>(
    this IEnumerable<TSource> source, int size)
{
    ArgumentNullException.ThrowIfNull(source);
    if (size < 1) throw new ArgumentOutOfRangeException(nameof(size));
    return DeferSourceException(source, s => s.Chunk(size));
}

在線示例。

閱讀代碼可能並不明顯，但DeferSourceException序列的所有枚舉器都獲得了自己的edi狀態。 沒有危險並排創建兩個枚舉器，這會相互干擾。

需要對 LINQ 運算符進行魯棒化的想法可能聽起來很奇怪，甚至有些傲慢。 請注意，此答案的上下文非常具體：它是生產者-消費者場景。 在這些場景中，多個生產者和消費者可能並行運行，偶爾會出現異常，並且彈性機制已經到位，可以避免由於錯誤而到處丟失消息。

如何對 IEnumerable 進行分塊<T> ，在失敗的情況下不會丟失/丟棄物品？

問題描述

4 個解決方案

解決方案1
1 2022-07-20 19:27:56

解決方案2
0 2022-07-20 20:11:01

解決方案3
0 2022-07-20 20:17:48

解決方案4
0 2022-07-21 00:45:05

如何對 IEnumerable 進行分塊<T> ，在失敗的情況下不會丟失/丟棄物品？

問題描述

4 個解決方案

解決方案1 1 2022-07-20 19:27:56

解決方案2 0 2022-07-20 20:11:01

解決方案3 0 2022-07-20 20:17:48

解決方案4 0 2022-07-21 00:45:05

解決方案1
1 2022-07-20 19:27:56

解決方案2
0 2022-07-20 20:11:01

解決方案3
0 2022-07-20 20:17:48

解決方案4
0 2022-07-21 00:45:05