简体   繁体   中英

Ruby: group_by operation on an array of hashes

I have an array of hashes that represent compounds stored in boxes.

database = [{"Name"=>"Compound1", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound3", "Box"=>1},
            {"Name"=>"Compound4", "Box"=>1},
            {"Name"=>"Compound5", "Box"=>2},
            {"Name"=>"Compound6", "Box"=>2},
            {"Name"=>"Compound1", "Box"=>3},
            {"Name"=>"Compound2", "Box"=>3},
            {"Name"=>"Compound3", "Box"=>3},
            {"Name"=>"Compound7", "Box"=>4}]

I would like to select a subset of the array, minimum by the number of boxes, that covers the full inventory of compounds (ie, 1 to 7). Thus the result would be:

database = [{"Name"=>"Compound1", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound3", "Box"=>1},
            {"Name"=>"Compound4", "Box"=>1},
            {"Name"=>"Compound5", "Box"=>2},
            {"Name"=>"Compound6", "Box"=>2},
            {"Name"=>"Compound7", "Box"=>4}]

I can use the following to group compounds per box:

database.group_by{|x| x['Box']}

I have trouble reducing the result so that duplicate compound names are removed from the grouped operation.

With Ruby >= 2.4 we can use transform_values :

database.group_by { |hash| hash["Name"] }
        .transform_values { |v| v.min_by { |h| h["Box"] } }
        .values

Or if you have Ruby < 2.4 you can do:

database.group_by {|hash| hash["Name"] }.map { |_,v| v.min_by {|h| h["Box"]} }

Key methods: group_by , transform_values (Ruby > 2.4) and min_by . See Ruby Docs for more info.

You could try with Array#uniq :

database = [{name: "Compound1", box: 1}, {name: "Compound2", box: 1}, {name: "Compound2", box: 1}, {name: "Compound3", box: 1}, {name: "Compound4", box: 1}, {name: "Compound5", box: 2}, {name: "Compound6", box: 2}, {name: "Compound1", box: 3}, {name: "Compound2", box: 3}, {name: "Compound3", box: 3}, {name: "Compound7", box: 4}]

p database.uniq{|k,_v| k[:name]}
# =>  [
#   {:name=>"Compound1", :box=>1}, 
#   {:name=>"Compound2", :box=>1}, 
#   {:name=>"Compound3", :box=>1}, 
#   {:name=>"Compound4", :box=>1}, 
#   {:name=>"Compound5", :box=>2}, 
#   {:name=>"Compound6", :box=>2}, 
#   {:name=>"Compound7", :box=>4}
# ]

Or:

p database.group_by{|k,_v| k[:box]}.each{|_k,v| v.uniq!{|k,_v| k[:name]}}

# => {
#   1=>[
#     {:name=>"Compound1", :box=>1},
#     {:name=>"Compound2", :box=>1},
#     {:name=>"Compound3", :box=>1},
#     {:name=>"Compound4", :box=>1}
#   ], 
#   2=>[
#     {:name=>"Compound5", :box=>2}, 
#     {:name=>"Compound6", :box=>2}
#   ],
#   3=>[
#     {:name=>"Compound1", :box=>3},
#     {:name=>"Compound2", :box=>3},
#     {:name=>"Compound3", :box=>3}
#   ],
#   4=>[
#     {:name=>"Compound7", :box=>4}
#   ]
# }

The essence of the problem is to find a minimal-size combination of boxes that includes ("covers") all of a set of specified "components". That combination of boxes is then used to compute objects of interest, as shown below.

Code

def min_box(database, coverage)
  boxes_to_compounds = database.each_with_object(Hash.new {|h,k| h[k]=[]}) { |g,h|
    h[g["Box"]] << g["Name"] }
  boxes = boxes_to_compounds.keys
  (1...boxes.size).each do |n|
    boxes.combination(n).each do |combo| return combo if
      (coverage - combo.flat_map { |box| boxes_to_compounds[box] }).empty? 
    end
  end
  nil
end

coverage is a given array of required compounds (eg, "Compound3").

Example

Suppose we are given database as given in the question and

coverage = ["Compound1", "Compound2", "Compound3", "Compound4",
            "Compound5", "Compound6", "Compound7"] 

An optimal combination of boxes is then found to be

combo = min_box(database, coverage)
  #=> [1, 2, 4]

We may now compute the desired array of elements of database :

database.select { |h| combo.include?(h["Box"]) }.uniq
  #=> [{"Name"=>"Compound1", "Box"=>1}, {"Name"=>"Compound2", "Box"=>1},
  #    {"Name"=>"Compound3", "Box"=>1}, {"Name"=>"Compound4", "Box"=>1},
  #    {"Name"=>"Compound5", "Box"=>2}, {"Name"=>"Compound6", "Box"=>2},
  #    {"Name"=>"Compound7", "Box"=>4}] 

min_box explanation

Finding an optimal combination of boxes is a hard (NP-complete) problem. Some form of enumeration of combinations of boxes is therefore required. I begin by determining if a single box provides the required coverage of components. If one of the boxes does, an optimal solution has been found and the method returns an array containing that box. If no single box covers all compounds, I look at all combinations of two boxes. If one of those combinations provides the required coverage it is an optimal solution and an array of those boxes is returned; else combinations of three boxes are considered. Eventually an optimal combination is found or it is concluded that all boxes together do not provide the required coverage, in which case the method returns nil .

For the example above, the calculations are as follows.

boxes_to_compounds = database.each_with_object(Hash.new {|h,k| h[k]=[]}) { |g,h|
  h[g["Box"]] << g["Name"] }
  #=> {1=>["Compound1", "Compound2", "Compound2", "Compound3", "Compound4"],
  #    2=>["Compound5", "Compound6"],
  #    3=>["Compound1", "Compound2", "Compound3"],
  #    4=>["Compound7"]}
boxes = boxes_to_compounds.keys
  #=> [1, 2, 3, 4]
boxes.size
  #=> 4

Each of the elements 1...boxes.size is passed to the outer each block. Consider box 3 .

n = 3
e = boxes.combination(n)
  #=> #<Enumerator: [1, 2, 3, 4]:combination(3)> 

We may see the objects that will be generated by this enumerator and passed to the inner each block by converting it to an array.

e.to_a
  #=> [[1, 2, 3], [1, 2, 4], [1, 3, 4], [2, 3, 4]] 

The first element generated by e is passed to the block and the following is computed.

combo = e.next
  #=> [1, 2, 3]
a = combo.flat_map { |box| boxes_to_compounds[box] }
  #=> ["Compound1", "Compound2", "Compound2", "Compound3", "Compound4",
  #    "Compound5", "Compound6", "Compound1", "Compound2", "Compound3"] 
b = coverage - a  
  #=> ["Compound7"] 
b.empty?
  #=> false 

As that combination of boxes does not include "Compound7" we press on and pass the next element generated by e to the block.

combo = e.next
  #=> [1, 2, 4] 
a = combo.flat_map { |box| boxes_to_compounds[box] }
  #=> ["Compound1", "Compound2", "Compound2", "Compound3", "Compound4",
  #    "Compound5", "Compound6", "Compound7"] 
b = coverage - a  
  #=> [] 
b.empty?
  #=> true 

We therefore have found an optimal combination of boxes, [1, 2, 4] , which is returned by the method.

I don't like that original data structure. Why not just start with a hash of {CompoundX => BoxY} since "Name" and "Box" are not really useful. But if you're married to that structure, here's how I would do it:

database = [{"Name"=>"Compound1", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound3", "Box"=>1},
            {"Name"=>"Compound4", "Box"=>1},
            {"Name"=>"Compound5", "Box"=>2},
            {"Name"=>"Compound6", "Box"=>2},
            {"Name"=>"Compound1", "Box"=>3},
            {"Name"=>"Compound2", "Box"=>3},
            {"Name"=>"Compound3", "Box"=>3},
            {"Name"=>"Compound7", "Box"=>4}]

new_db_arr = database.collect{|h| h.flatten}.flatten.collect{|i| i if i != "Name" && i != "Box"}.compact!
new_db_hash = {}
new_db_arr.each_slice(2) do |a,b|
  if new_db_hash[a].nil?
    new_db_hash[a] = []
  end
  new_db_hash[a] << b
end

new_db_hash
boxes = new_db_hash.values
combos = boxes[0].product(*boxes[1..-1])
combos = combos.sort_by{|a| a.uniq.length }
winning_combo = combos[0].uniq

The bulk of the work is just transforming the data structure into the hash of :Compound => boxNumber format. Then you generate every combination of boxes, sort by the combination's number of uniq items and take the one with the smallest number of uniq items as the answer. Not sure how great this would scale for very large datasets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM