简体   繁体   中英

Analog of python groupby in haskell

In python there is groupby function.

It's type can be expressed in haskell like this groupby :: a->b->[a]->[(b, [a])] Because it need data to be sorted we can think of it's running time as O(n*log(n)) .

I was probably not the only one dissatisfied with this, so I found this library This implementation of groupby need two passes over the input sequence. So I think its running time is O(n) , but as it says in the docs it isn't really lazy, because if you don' pass keys to it it would need to make a pass over sequence to collect all unique keys from items.

So I thought, citing Raymond Hetttinger

There must be a better way!

So I wrote this

from collections import defaultdict, deque


def groupby(sequence, key=lambda x: x):
    buffers = defaultdict(deque)
    kvs = ((key(item), item) for item in sequence)
    seen_keys = set()
    def subseq(k):
        while True:
            buffered = buffers[k]
            if buffered:
                yield buffered.popleft()
            else:
                next_key, value = next(kvs)
                buffers[next_key].append(value)
    while True:
        try:
            k, value = next(kvs)
        except StopIteration:
            for bk, group in buffers.items():
                if group and bk not in seen_keys:
                    yield (bk, group)
            raise StopIteration()
        else:
            buffers[k].append(value)
        if k not in seen_keys:
            seen_keys.add(k)
            yield k, subseq(k)

In case you aren't familiar with python the idea is very simple. Create a mutable dictionary of key -> queue of elements Try take next element of sequence and its key value. If sequence isn't empty add this value to the group queue according to its key. If we haven't seen this key yield a pair (key, iterable group ) latter one would take keys either from buffer or from sequence. If we already seen this this key do nothing more and loop.

If sequence is ended it means all its element already either have put in buffers (and probably consumed). In case buffers aren't empty we iterate over them and yield renaming (key, iterable) pairs.

I've already unit tested it and its works. And it's truly lazy (meaning it wouldn't take any value from sequence until consumer haven't asked for it) and it's running time should be O(n) .

I've tried to haskell analog of this function and haven't found any.

Is it possible to write such thing in haskell? If so, please show the solution, if not, then explain why.

If I understand this correctly, the type you want is

(a -> k) -> [a] -> [(k, [a])]

That is, given a key function and a list of items, group the items by the key.

In Haskell there is a library function groupBy which does something similar. It assumes you have a sorted list, and it groups items that meet a Boolean condition into sublists. We can use it to do what you want:

import Data.List
import Data.Ord

groupByKey :: (a -> k) -> [a] -> [(k, [a])]
groupByKey keyF xs = map getResult groups
   where
      keyPairs = map (\v -> (keyF v, v)) xs
      groups = groupBy (\v1 v2 -> fst v1 == fst v2) 
                  $ sortBy (comparing fst) keyPairs
      getResult xs = (fst $ head xs, map snd xs)

keyPairs is the pair (key, value) for each element in the argument. groups first sorts this into key order using sortBy and then groups the results into sublists that share the same key. getResult converts a sublist into a pair containing the key (taken from the head element) and a list of the original values. We are safe to use head because groupBy never gives an empty sublist.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM