简体   繁体   中英

Clojure: Chaining group-by :key with select-keys on remaining keys

I'm trying to understand a simple (as in other languages) workflow with clojure maps.

It basically comes down to this: How can chain these operations?

  1. group-by :key on a vector of maps

  2. select-keys on remaining maps without the previous key

  3. group-by again (0..n times) and select-keys

  4. count unique key instances at the end.

See also my previous question: Aggregate and Count in Maps

Example:

Given a vector of maps

(def DATA [{:a "X", :b "M", :c "K", :d 10}
           {:a "Y", :b "M", :c "K", :d 20}
           {:a "Y", :b "M", :c "F", :d 30}
           {:a "Y", :b "P", :c "G", :d 40}])

performing group-by

(defn get-tree-level-1 [] (group-by :a DATA))

yields a map grouped by the value of that particular key.

{ X [{:a X, :b M, :c K, :d 10}],
  Y [{:a Y, :b M, :c K, :d 20}
     {:a Y, :b M, :c F, :d 30}
     {:a Y, :b P, :c G, :d 40}]}  

So far, so good. But what if I want to build a tree-like structure out of the data, which means selecting the remaining keys and ignoring some, select :b and :c and ignore :d , which would yield in the next level:

(def DATA2   [{ :X [{:b "M", :c "K"}],
                :Y [{:b "M", :c "K"}
                    {:b "M", :c "F"}
                    {:b "P", :c "G"}]}])

And finally, counting all instances of the remaining keys (eg count all unique values of the :b key under the Y -root):

(def DATA3   [{ :X [{:M  1}],
                :Y [{:M  2}
                    {:P  1}])

I tried doing a select-keys after the group-by , but the result was empty after the first step:

(defn get-proc-sums []
  (into {}
    (map
      (fn [ [k vs] ]
        [k (select-keys vs [:b :c])])
      (group-by :a DATA))))

Repeated application of group-by is the wrong tool: it doesn't compose with itself very well. Rather, go over your input maps and transform each of them, once, into a format that's useful to you (using for or map ), and then reduce over that to build your tree structure. Here is a simple implementation:

(defn hierarchy [keyseq xs]
  (reduce (fn [m [ks x]]
            (update-in m ks conj x))
          {}
          (for [x xs]
            [(map x keyseq) (apply dissoc x keyseq)])))

user> (hierarchy [:a :b :c] '[{:a "X", :b "M", :c "K", :d 10}
                              {:a "Y", :b "M", :c "K", :d 20}
                              {:a "Y", :b "M", :c "F", :d 30}
                              {:a "Y", :b "P", :c "G", :d 40}])
{"Y" {"P" {"G" ({:d 40})},
      "M" {"F" ({:d 30}),
           "K" ({:d 20})}},
 "X" {"M" {"K" ({:d 10})}}}

This gives you the hierarchical format that you want, with a list of all maps with only the "leftover" keys. From this, you can count them, distinct them, remove the :d key, or whatever else you want, either by writing another function that processes this map, or by adjusting what happens in the reduce function, or the for comprehension, above.

the mistake is that you are trying to select keys from values collection, while you should do it for every item in coll, eg using map :

(defn get-proc-sums []
  (into {}
        (map
         (fn [ [k vs] ]
           [k (map #(select-keys % [:b :c]) vs)])
         (group-by :a DATA))))

user> (get-proc-sums)
{"X" ({:b "M", :c "K"}), 
 "Y" ({:b "M", :c "K"} {:b "M", :c "F"} {:b "P", :c "G"})}

what you're doing is:

user> (group-by :a DATA)
{"X" [{:a "X", :b "M", :c "K", :d 10}], 
 "Y" [{:a "Y", :b "M", :c "K", :d 20} 
      {:a "Y", :b "M", :c "F", :d 30} 
      {:a "Y", :b "P", :c "G", :d 40}]}

then you are processing every key-value pair (let's take "Y" pair for that):

user> (let [[k vals] ["Y" ((group-by :a DATA) "Y")]]
         [k vals])
["Y" [{:a "Y", :b "M", :c "K", :d 20} 
      {:a "Y", :b "M", :c "F", :d 30} 
      {:a "Y", :b "P", :c "G", :d 40}]]

so you do select-keys for a vector of maps:

user> (select-keys [{:a "Y", :b "M", :c "K", :d 20} 
                    {:a "Y", :b "M", :c "F", :d 30} 
                    {:a "Y", :b "P", :c "G", :d 40}]
                   [:a :b])
{}

which is logical, since you don't have these keys in vector.

user> (map #(select-keys % [:a :b]) [{:a "Y", :b "M", :c "K", :d 20} 
                                     {:a "Y", :b "M", :c "F", :d 30} 
                                     {:a "Y", :b "P", :c "G", :d 40}])
({:a "Y", :b "M"} {:a "Y", :b "M"} {:a "Y", :b "P"})

update: to fulfill the whole task, I would propose the following:

(defn process-data [data]
  (->> data
       (group-by :a)
       (map (fn [[k vals]] [k (frequencies (map :b vals))]))
       (into {})))

user> (process-data DATA)
{"X" {"M" 1}, "Y" {"M" 2, "P" 1}}

Here I'll only be addressing the workflow aspect of your question, and one way of thinking through the function design. I present only one way out of many, but I think this way is sufficiently idiomatic. If you're looking for an implementation, amalloy provided a fine one.

The problem you pose is a perfect use-case for recursion. You want to build a nested structure where each level of nesting (except for the last) just follows the same grouping process on the previous grouping result. The last level of nesting instead performs a count. And you don't know in advance how many levels of nesting there will be.

You're throwing away the :c and the :d , so you might as well do that at the start -- it's logically a distinct processing step.

Let's assume you've written your function (call it foo -- I leave its writing as an exercises for the reader). It can construct the nested structure in terms of recursive calls to itself.

Let's take your example data set:

(def DATA [{:a "X", :b "M", :c "K", :d 10}
           {:a "Y", :b "M", :c "K", :d 20}
           {:a "Y", :b "M", :c "F", :d 30}
           {:a "Y", :b "P", :c "G", :d 40}])

Let's ignore :d , so our pre-processed set looks like:

(def filtered-data [{:a "X", :b "M", :c "K"}
                    {:a "Y", :b "M", :c "K"}
                    {:a "Y", :b "M", :c "F"}
                    {:a "Y", :b "P", :c "G"}])

Example

And here's an example "query":

(foo filtered-data
     [:a :b :c])

We want it to spit out a nested structure that looks a bit like this:

[{ :X (foo [{:b "M", :c "K"}]
           [:b :c]),
   :Y (foo [{:b "M", :c "K"}
            {:b "M", :c "F"}
            {:b "P", :c "G"}]
           [:b :c]}])

This in turn is equivalent to:

[{ :X [{:M (foo [{:c "K"}]
                [:c])}],
   :Y [{:M (foo [{:c "K"}
                 {:c "F"}]
                [:c]),
        :P (foo [{:c "G"}]
                [:c])}]
]}

These foo s can easily recognize the end of the recursion and switch to a counting behavior:

[{ :X [{:M [{:K 1}]}],
   :Y [{:M [{:F 1}
            {:K 1}],
        :P [{:G 1}]
      }]
]}

Personally, if I were building up such a structure, I'd target one with less "superfluous" nesting, such as this trie :

{"X" {"M" {"K" 1}},
 "Y" {"M" {"F" 1, "K" 1},
      "P" {"G" 1}}

But I don't know your use case and whether these are truly superfluous. And if you might want to use this data to produce more than one statistic, then see how amalloy made a condensed structure from which you could derive counts, or anything else.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM