简体   繁体   中英

Ruby: counting records in an array

I have two files: (1) the first file contains the users of the system, I read that file into an array (2) the second file contains statistics about those users

My task is to have a user count, something like

{"user1" => 1, "user2" => 0, "user3" => 4}

This how I solved the problem

# Result wanted
# Given the names and stats array generate the results array
# result = {'user1' => 3, 'user2' => 1, 'user3' => 0, 'user4' => 1}

names = ['user1', 'user2', 'user3', 'user4']
stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']

hash = Hash[names.map {|v| [v, 0]}] # to make sure every name gets a value

stats.each do |item| # basic loop to count the records
   hash[item] += 1 if hash.has_key?(item)
end

puts hash

# terminal outcome
# $ ruby example.rb 
# {"user1"=>3, "user2"=>2, "user3"=>0, "user4"=>1}

I'm only curious if there is a better way than counting in a loop, specially since Ruby comes with magical powers and I come from a C background

Basically your code is the fastest you can get code to run for this, except for some minor issues.

If you have an unneeded entry marking the end of the array

stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']

I think you should pop it off prior to running, since it has the possibility of resulting in a weird entry, and its existence forces you to use the conditional test in your loop, slowing your code.

stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']
stats.pop # => "xxx"
stats # => ["user1", "user1", "user1", "user2", "user4", "user2"]

Built-in methods exist, which reduce the amount of code to a single call, but they're slower than a loop:

stats.group_by{ |e| e } # => {"user1"=>["user1", "user1", "user1"], "user2"=>["user2", "user2"], "user4"=>["user4"], "xxx"=>["xxx"]}

From there it's easy to map the resulting hash into summaries:

stats.group_by{ |e| e }.map{ |k, v| [k, v.size] } # => [["user1", 3], ["user2", 2], ["user4", 1]]

And then into a hash again:

stats.group_by{ |e| e }.map{ |k, v| [k, v.size] }.to_h # => {"user1"=>3, "user2"=>2, "user4"=>1}

or:

Hash[stats.group_by{ |e| e }.map{ |k, v| [k, v.size] }] # => {"user1"=>3, "user2"=>2, "user4"=>1}

Using the built-in methods are efficient, and very useful when you're dealing with very large lists, because there's very little redundant looping going on.

Looping over the data like you did is also very fast, and usually faster than the built-in methods, when written correctly. Here are some benchmarks showing alternate ways of accomplishing this stuff:

require 'fruity'  # => true

names = ['user1', 'user2', 'user3', 'user4']
stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2']

Hash[names.map {|v| [v, 0]}]                    # => {"user1"=>0, "user2"=>0, "user3"=>0, "user4"=>0}
Hash[names.zip([0] * names.size )]              # => {"user1"=>0, "user2"=>0, "user3"=>0, "user4"=>0}
names.zip([0] * names.size ).to_h               # => {"user1"=>0, "user2"=>0, "user3"=>0, "user4"=>0}
hash = {}; names.each{ |k| hash[k] = 0 }; hash  # => {"user1"=>0, "user2"=>0, "user3"=>0, "user4"=>0}

compare do 
  map_hash { Hash[names.map {|v| [v, 0]}] }
  zip_hash { Hash[names.zip([0] * names.size )] }
  to_h_hash { names.zip([0] * names.size ).to_h }
  hash_braces { hash = {}; names.each{ |k| hash[k] = 0 }; hash }
end

# >> Running each test 2048 times. Test will take about 1 second.
# >> hash_braces is faster than map_hash by 50.0% ± 10.0%
# >> map_hash is faster than to_h_hash by 19.999999999999996% ± 10.0%
# >> to_h_hash is faster than zip_hash by 10.000000000000009% ± 10.0%

Looking at the conditional in the loop to see how it effects the code:

require 'fruity'  # => true

NAMES = ['user1', 'user2', 'user3', 'user4']
STATS = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']
STATS2 = STATS[0 .. -2]

def build_hash
  h = {}
  NAMES.each{ |k| h[k] = 0 }
  h
end

compare do 

  your_way {
    hash = build_hash()
    STATS.each do |item| # basic loop to count the records
      hash[item] += 1 if hash.has_key?(item)
    end
    hash
  }

  my_way {
    hash = build_hash()
    STATS2.each { |e| hash[e] += 1 }
    hash
  }
end

# >> Running each test 512 times. Test will take about 1 second.
# >> my_way is faster than your_way by 27.0% ± 1.0%

While several answers suggested using count , the code is going to slow down a lot as your lists increase in size, where walking the stats array once, as you are doing, will always be linear, so stick to one of these iterative solutions.

names = ['user1', 'user2', 'user3', 'user4']
stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']

stats.each_with_object(Hash.new(0)) { |user,hash| 
  hash[user] += 1 if names.include?(user) }
#=> {"user1"=>3, "user2"=>2, "user4"=>1}
names = ['user1', 'user2', 'user3', 'user4']
stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']

hash = Hash.new

names.each { |name| hash[name] = stats.count(name) }

puts hash

You could use with map and Hash[].

names = ['user1', 'user2', 'user3', 'user4']
stats = ['user1', 'user1', 'user1', 'user2', 'user4', 'user2', 'xxx']
hash = Hash[names.map { |name| [name, stats.count(name)] }]
puts hash

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM