简体   繁体   中英

Grouping consecutive identical items: IEnumerable<T> to IEnumerable<IEnumerable<T>>

I've got an interresting problem: Given an IEnumerable<string> , is it possible to yield a sequence of IEnumerable<IEnumerable<string>> that groups identical adjacent strings in one pass?

Let me explain.

1. Basic illustrative sample :

Considering the following IEnumerable<string> (pseudo representation):

{"a","b","b","b","c","c","d"}

How to get an IEnumerable<IEnumerable<string>> that would yield something of the form:

{ // IEnumerable<IEnumerable<string>>
    {"a"},         // IEnumerable<string>
    {"b","b","b"}, // IEnumerable<string>
    {"c","c"},     // IEnumerable<string>
    {"d"}          // IEnumerable<string>
}

The method prototype would be:

public IEnumerable<IEnumerable<string>> Group(IEnumerable<string> items)
{
    // todo
}

But it could also be :

public void Group(IEnumerable<string> items, Action<IEnumerable<string>> action)
{
    // todo
}

...where action would be called for each subsequence.

2. More complicated sample

Ok, the first sample is very simple, and only aims to make the high level intent clear.

Now imagine we are dealing with IEnumerable<Anything> , where Anything is a type defined like this:

public class Anything
{
    public string Key {get;set;}
    public double Value {get;set;}
}

We now want to generate the subsequences based on the Key, (group every consecutive Anything that have the same key) to later use them in order to calculate the total value by group:

public void Compute(IEnumerable<Anything> items)
{
    Console.WriteLine(items.Sum(i=>i.Value));
}

// then somewhere, assuming the Group method 
// that returns an IEnumerable<IEnumerable<Anything>> actually exists:
foreach(var subsequence in Group(allItems))
{
    Compute(subsequence);
}

3. Important notes

  • Only one iteration over the original sequence
  • No intermediary collections allocations (we can assume millions of items in the original sequence, and millions consecutives items in each group)
  • Keeping enumerators and defered execution behavior
  • We can assume that resulting subsequences will be iterated only once, and will be iterated in order.

Is it possible, and how would you write it?

Is this what you are looking for?

  • Iterate list only once.
  • Defer execution.
  • No intermediate collections (my other post failed on this criterion).

This solution relies on object state because it's difficult to share state between two IEnumerable methods that use yield (no ref or out params).

internal class Program
{
    static void Main(string[] args)
    {
        var result = new[] { "a", "b", "b", "b", "c", "c", "d" }.Partition();
        foreach (var r in result)
        {
            Console.WriteLine("Group".PadRight(16, '='));
            foreach (var s in r)
                Console.WriteLine(s);
        }
    }
}

internal static class PartitionExtension
{
    public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> src)
    {
        var grouper = new DuplicateGrouper<T>();
        return grouper.GroupByDuplicate(src);
    }
}

internal class DuplicateGrouper<T>
{
    T CurrentKey;
    IEnumerator<T> Itr;
    bool More;

    public IEnumerable<IEnumerable<T>> GroupByDuplicate(IEnumerable<T> src)
    {
        using(Itr = src.GetEnumerator())
        {
            More = Itr.MoveNext();

            while (More)
                yield return GetDuplicates();
        }
    }

    IEnumerable<T> GetDuplicates()
    {
        CurrentKey = Itr.Current;
        while (More && CurrentKey.Equals(Itr.Current))
        {
            yield return Itr.Current;
            More = Itr.MoveNext();
        }
    }
}

Edit: Added extension method for cleaner usage. Fixed loop test logic so that "More" is evaluated first.

Edit: Dispose the enumerator when finished

Way Better Solution That Meets All Requirements

OK, scrap my previous solution (I'll leave it below, just for reference). Here's a much better approach that occurred to me after making my initial post.

Write a new class that implements IEnumerator<T> and provides a few additional properties: IsValid and Previous . This is all you really need to resolve the whole mess with having to maintain state inside an iterator block using yield .

Here's how I did it (pretty trivial, as you can see):

internal class ChipmunkEnumerator<T> : IEnumerator<T> {

    private readonly IEnumerator<T> _internal;
    private T _previous;
    private bool _isValid;

    public ChipmunkEnumerator(IEnumerator<T> e) {
        _internal = e;
        _isValid = false;
    }

    public bool IsValid {
        get { return _isValid; }
    }

    public T Previous {
        get { return _previous; }
    }

    public T Current {
        get { return _internal.Current; }
    }

    public bool MoveNext() {
        if (_isValid)
            _previous = _internal.Current;

        return (_isValid = _internal.MoveNext());
    }

    public void Dispose() {
        _internal.Dispose();
    }

    #region Explicit Interface Members

    object System.Collections.IEnumerator.Current {
        get { return Current; }
    }

    void System.Collections.IEnumerator.Reset() {
        _internal.Reset();
        _previous = default(T);
        _isValid = false;
    }

    #endregion

}

(I called this a ChipmunkEnumerator because maintaining the previous value reminded me of how chipmunks have pouches in their cheeks where they keep nuts. Does it really matter? Stop making fun of me.)

Now, utilizing this class in an extension method to provide exactly the behavior you want isn't so tough!

Notice that below I've defined GroupConsecutive to actually return an IEnumerable<IGrouping<TKey, T>> for the simple reason that, if these are grouped by key anyway, it makes sense to return an IGrouping<TKey, T> rather than just an IEnumerable<T> . As it turns out, this will help us out later anyway...

public static IEnumerable<IGrouping<TKey, T>> GroupConsecutive<T, TKey>(this IEnumerable<T> source, Func<T, TKey> keySelector)
    where TKey : IEquatable<TKey> {

    using (var e = new ChipmunkEnumerator<T>(source.GetEnumerator())) {
        if (!e.MoveNext())
            yield break;

        while (e.IsValid) {
            yield return e.GetNextDuplicateGroup(keySelector);
        }
    }
}

public static IEnumerable<IGrouping<T, T>> GroupConsecutive<T>(this IEnumerable<T> source)
    where T : IEquatable<T> {

    return source.GroupConsecutive(x => x);
}

private static IGrouping<TKey, T> GetNextDuplicateGroup<T, TKey>(this ChipmunkEnumerator<T> e, Func<T, TKey> keySelector)
    where TKey : IEquatable<TKey> {

    return new Grouping<TKey, T>(keySelector(e.Current), e.EnumerateNextDuplicateGroup(keySelector));
}

private static IEnumerable<T> EnumerateNextDuplicateGroup<T, TKey>(this ChipmunkEnumerator<T> e, Func<T, TKey> keySelector)
    where TKey : IEquatable<TKey> {

    do {
        yield return e.Current;

    } while (e.MoveNext() && keySelector(e.Previous).Equals(keySelector(e.Current)));
}

(To implement these methods, I wrote a simple Grouping<TKey, T> class that implements IGrouping<TKey, T> in the most straightforward way possible. I've omitted the code just so as to keep moving along...)

OK, check it out. I think the code example below pretty well captures something resembling the more realistic scenario you described in your updated question.

var entries = new List<KeyValuePair<string, int>> {
    new KeyValuePair<string, int>( "Dan", 10 ),
    new KeyValuePair<string, int>( "Bill", 12 ),
    new KeyValuePair<string, int>( "Dan", 14 ),
    new KeyValuePair<string, int>( "Dan", 20 ),
    new KeyValuePair<string, int>( "John", 1 ),
    new KeyValuePair<string, int>( "John", 2 ),
    new KeyValuePair<string, int>( "Bill", 5 )
};

var dupeGroups = entries
    .GroupConsecutive(entry => entry.Key);

foreach (var dupeGroup in dupeGroups) {
    Console.WriteLine(
        "Key: {0} Sum: {1}",
        dupeGroup.Key.PadRight(5),
        dupeGroup.Select(entry => entry.Value).Sum()
    );
}

Output:

Key: Dan   Sum: 10
Key: Bill  Sum: 12
Key: Dan   Sum: 34
Key: John  Sum: 3
Key: Bill  Sum: 5

Notice this also fixes the problem with my original answer of dealing with IEnumerator<T> objects that were value types. (With this approach, it doesn't matter.)

There's still going to be a problem if you try calling ToList here, as you will find out if you try it. But considering you included deferred execution as a requirement , I doubt you would be doing that anyway. For a foreach , it works.


Original, Messy, and Somewhat Stupid Solution

Something tells me I'm going to get totally refuted for saying this, but...

Yes , it is possible (I think). See below for a damn messy solution I threw together. (Catches an exception to know when it's finished, so you know it's a great design!)

Now, Jon's point about there being a very real problem in the event that you try to do, for instance, ToList , and then access the values in the resulting list by index, is totally valid. But if your only intention here is to be able to loop over an IEnumerable<T> using a foreach -- and you're only doing this in your own code -- then, well, I think this could work for you.

Anyway, here's a quick example of how it works:

var ints = new int[] { 1, 3, 3, 4, 4, 4, 5, 2, 3, 1, 6, 6, 6, 5, 7, 7, 8 };

var dupeGroups = ints.GroupConsecutiveDuplicates(EqualityComparer<int>.Default);

foreach (var dupeGroup in dupeGroups) {
    Console.WriteLine(
        "New dupe group: " +
        string.Join(", ", dupeGroup.Select(i => i.ToString()).ToArray())
    );
}

Output:

New dupe group: 1
New dupe group: 3, 3
New dupe group: 4, 4, 4
New dupe group: 5
New dupe group: 2
New dupe group: 3
New dupe group: 1
New dupe group: 6, 6, 6
New dupe group: 5
New dupe group: 7, 7
New dupe group: 8

And now for the (messy as crap) code:

Note that since this approach requires passing the actual enumerator around between a few different methods, it will not work if that enumerator is a value type, as calls to MoveNext in one method are only affecting a local copy.

public static IEnumerable<IEnumerable<T>> GroupConsecutiveDuplicates<T>(this IEnumerable<T> source, IEqualityComparer<T> comparer) {
    using (var e = source.GetEnumerator()) {
        if (e.GetType().IsValueType)
            throw new ArgumentException(
                "This method will not work on a value type enumerator."
            );

        // get the ball rolling
        if (!e.MoveNext()) {
            yield break;
        }

        IEnumerable<T> nextDuplicateGroup;

        while (e.FindMoreDuplicates(comparer, out nextDuplicateGroup)) {
            yield return nextDuplicateGroup;
        }
    }
}

private static bool FindMoreDuplicates<T>(this IEnumerator<T> enumerator, IEqualityComparer<T> comparer, out IEnumerable<T> duplicates) {
    duplicates = enumerator.GetMoreDuplicates(comparer);

    return duplicates != null;
}

private static IEnumerable<T> GetMoreDuplicates<T>(this IEnumerator<T> enumerator, IEqualityComparer<T> comparer) {
    try {
        if (enumerator.Current != null)
            return enumerator.GetMoreDuplicatesInner(comparer);
        else
            return null;

    } catch (InvalidOperationException) {
        return null;
    }
}

private static IEnumerable<T> GetMoreDuplicatesInner<T>(this IEnumerator<T> enumerator, IEqualityComparer<T> comparer) {
    while (enumerator.Current != null) {
        var current = enumerator.Current;
        yield return current;

        if (!enumerator.MoveNext())
            break;

        if (!comparer.Equals(current, enumerator.Current))
            break;
    }
}

Your second bullet is the problematic one. Here's why:

var groups = CallMagicGetGroupsMethod().ToList();
foreach (string x in groups[3])
{
    ...
}
foreach (string x in groups[0])
{
    ...
}

Here, it's trying to iterate over the fourth group and then the first group... that's clearly only going to work if all the groups are buffered or it can reread the sequence, neither of which is ideal.

I suspect you want a more "reactive" approach - I don't know offhand whether Reactive Extensions does what you want (the "consecutive" requirement is unusual) but you should basically provide some sort of action to be executed on each group... that way the method won't need to worry about having to return you something which could be used later on, after it's already finished reading.

Let me know if you'd like me to try to find a solution within Rx, or whether you would be happy with something like:

void GroupConsecutive(IEnumerable<string> items,
                      Action<IEnumerable<string>> action)

Here's a solution that I think satisfies your requirements, works with any type of data item, and is quite short and readable:

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> list)
{
    var current = list.FirstOrDefault();

    while (!Equals(current, default(T))) {
        var cur = current;
        Func<T, bool> equalsCurrent = item => item.Equals(cur);
        yield return list.TakeWhile(equalsCurrent);
        list = list.SkipWhile(equalsCurrent);
        current = list.FirstOrDefault();
    }
}

Notes:

  1. Deferred execution is there (both TakeWhile and SkipWhile do it).
  2. I think this iterates over the entire collection only once (with SkipWhile ); it does iterate over the collection once more when you process the returned IEnumerables, but the partitioning itself iterates only once.
  3. If you don't care about value types, you can add a constraint and change the while condition to a test for null .

If I am somehow mistaken, I 'd be especially interested in comments pointing out the mistakes!

Very Important Aside:

This solution will not allow you to enumerate the produced enumerables in any order other than the one it provides them in. However, I think the original poster has been pretty clear in comments that this is not a problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM