Clojure: Chaining group-by :key with select-keys on remaining keys

Question

I'm trying to understand a simple (as in other languages) workflow with clojure maps.

It basically comes down to this: How can chain these operations?

group-by :key on a vector of maps
select-keys on remaining maps without the previous key
group-by again (0..n times) and select-keys
count unique key instances at the end.

See also my previous question: Aggregate and Count in Maps

Example:

Given a vector of maps

(def DATA [{:a "X", :b "M", :c "K", :d 10}
           {:a "Y", :b "M", :c "K", :d 20}
           {:a "Y", :b "M", :c "F", :d 30}
           {:a "Y", :b "P", :c "G", :d 40}])

performing group-by

(defn get-tree-level-1 [] (group-by :a DATA))

yields a map grouped by the value of that particular key.

{ X [{:a X, :b M, :c K, :d 10}],
  Y [{:a Y, :b M, :c K, :d 20}
     {:a Y, :b M, :c F, :d 30}
     {:a Y, :b P, :c G, :d 40}]}

So far, so good. But what if I want to build a tree-like structure out of the data, which means selecting the remaining keys and ignoring some, select :b and :c and ignore :d , which would yield in the next level:

(def DATA2   [{ :X [{:b "M", :c "K"}],
                :Y [{:b "M", :c "K"}
                    {:b "M", :c "F"}
                    {:b "P", :c "G"}]}])

And finally, counting all instances of the remaining keys (eg count all unique values of the :b key under the Y -root):

(def DATA3   [{ :X [{:M  1}],
                :Y [{:M  2}
                    {:P  1}])

I tried doing a select-keys after the group-by , but the result was empty after the first step:

(defn get-proc-sums []
  (into {}
    (map
      (fn [ [k vs] ]
        [k (select-keys vs [:b :c])])
      (group-by :a DATA))))

Answer 1

Repeated application of group-by is the wrong tool: it doesn't compose with itself very well. Rather, go over your input maps and transform each of them, once, into a format that's useful to you (using for or map ), and then reduce over that to build your tree structure. Here is a simple implementation:

(defn hierarchy [keyseq xs]
  (reduce (fn [m [ks x]]
            (update-in m ks conj x))
          {}
          (for [x xs]
            [(map x keyseq) (apply dissoc x keyseq)])))

user> (hierarchy [:a :b :c] '[{:a "X", :b "M", :c "K", :d 10}
                              {:a "Y", :b "M", :c "K", :d 20}
                              {:a "Y", :b "M", :c "F", :d 30}
                              {:a "Y", :b "P", :c "G", :d 40}])
{"Y" {"P" {"G" ({:d 40})},
      "M" {"F" ({:d 30}),
           "K" ({:d 20})}},
 "X" {"M" {"K" ({:d 10})}}}

This gives you the hierarchical format that you want, with a list of all maps with only the "leftover" keys. From this, you can count them, distinct them, remove the :d key, or whatever else you want, either by writing another function that processes this map, or by adjusting what happens in the reduce function, or the for comprehension, above.

Answer 2

the mistake is that you are trying to select keys from values collection, while you should do it for every item in coll, eg using map :

(defn get-proc-sums []
  (into {}
        (map
         (fn [ [k vs] ]
           [k (map #(select-keys % [:b :c]) vs)])
         (group-by :a DATA))))

user> (get-proc-sums)
{"X" ({:b "M", :c "K"}), 
 "Y" ({:b "M", :c "K"} {:b "M", :c "F"} {:b "P", :c "G"})}

what you're doing is:

user> (group-by :a DATA)
{"X" [{:a "X", :b "M", :c "K", :d 10}], 
 "Y" [{:a "Y", :b "M", :c "K", :d 20} 
      {:a "Y", :b "M", :c "F", :d 30} 
      {:a "Y", :b "P", :c "G", :d 40}]}

then you are processing every key-value pair (let's take "Y" pair for that):

user> (let [[k vals] ["Y" ((group-by :a DATA) "Y")]]
         [k vals])
["Y" [{:a "Y", :b "M", :c "K", :d 20} 
      {:a "Y", :b "M", :c "F", :d 30} 
      {:a "Y", :b "P", :c "G", :d 40}]]

so you do select-keys for a vector of maps:

user> (select-keys [{:a "Y", :b "M", :c "K", :d 20} 
                    {:a "Y", :b "M", :c "F", :d 30} 
                    {:a "Y", :b "P", :c "G", :d 40}]
                   [:a :b])
{}

which is logical, since you don't have these keys in vector.

user> (map #(select-keys % [:a :b]) [{:a "Y", :b "M", :c "K", :d 20} 
                                     {:a "Y", :b "M", :c "F", :d 30} 
                                     {:a "Y", :b "P", :c "G", :d 40}])
({:a "Y", :b "M"} {:a "Y", :b "M"} {:a "Y", :b "P"})

update: to fulfill the whole task, I would propose the following:

(defn process-data [data]
  (->> data
       (group-by :a)
       (map (fn [[k vals]] [k (frequencies (map :b vals))]))
       (into {})))

user> (process-data DATA)
{"X" {"M" 1}, "Y" {"M" 2, "P" 1}}

Answer 3

Here I'll only be addressing the workflow aspect of your question, and one way of thinking through the function design. I present only one way out of many, but I think this way is sufficiently idiomatic. If you're looking for an implementation, amalloy provided a fine one.

The problem you pose is a perfect use-case for recursion. You want to build a nested structure where each level of nesting (except for the last) just follows the same grouping process on the previous grouping result. The last level of nesting instead performs a count. And you don't know in advance how many levels of nesting there will be.

You're throwing away the :c and the :d , so you might as well do that at the start -- it's logically a distinct processing step.

Let's assume you've written your function (call it foo -- I leave its writing as an exercises for the reader). It can construct the nested structure in terms of recursive calls to itself.

Let's take your example data set:

(def DATA [{:a "X", :b "M", :c "K", :d 10}
           {:a "Y", :b "M", :c "K", :d 20}
           {:a "Y", :b "M", :c "F", :d 30}
           {:a "Y", :b "P", :c "G", :d 40}])

Let's ignore :d , so our pre-processed set looks like:

(def filtered-data [{:a "X", :b "M", :c "K"}
                    {:a "Y", :b "M", :c "K"}
                    {:a "Y", :b "M", :c "F"}
                    {:a "Y", :b "P", :c "G"}])

Example

And here's an example "query":

(foo filtered-data
     [:a :b :c])

We want it to spit out a nested structure that looks a bit like this:

[{ :X (foo [{:b "M", :c "K"}]
           [:b :c]),
   :Y (foo [{:b "M", :c "K"}
            {:b "M", :c "F"}
            {:b "P", :c "G"}]
           [:b :c]}])

This in turn is equivalent to:

[{ :X [{:M (foo [{:c "K"}]
                [:c])}],
   :Y [{:M (foo [{:c "K"}
                 {:c "F"}]
                [:c]),
        :P (foo [{:c "G"}]
                [:c])}]
]}

These foo s can easily recognize the end of the recursion and switch to a counting behavior:

[{ :X [{:M [{:K 1}]}],
   :Y [{:M [{:F 1}
            {:K 1}],
        :P [{:G 1}]
      }]
]}

Personally, if I were building up such a structure, I'd target one with less "superfluous" nesting, such as this trie :

{"X" {"M" {"K" 1}},
 "Y" {"M" {"F" 1, "K" 1},
      "P" {"G" 1}}

But I don't know your use case and whether these are truly superfluous. And if you might want to use this data to produce more than one statistic, then see how amalloy made a condensed structure from which you could derive counts, or anything else.

Clojure: Chaining group-by :key with select-keys on remaining keys

Question

3 answers

solution1
3 ACCPTED 2016-03-23 18:57:23

solution2
2 2016-03-23 16:45:25

solution3
1 2016-03-23 18:56:43

Clojure: Chaining group-by :key with select-keys on remaining keys

Question

3 answers

solution1 3 ACCPTED 2016-03-23 18:57:23

solution2 2 2016-03-23 16:45:25

solution3 1 2016-03-23 18:56:43

solution1
3 ACCPTED 2016-03-23 18:57:23

solution2
2 2016-03-23 16:45:25

solution3
1 2016-03-23 18:56:43