Performance of Arrays and Hashes in Ruby

Question

I have a program that will store many instances of one class, let's say up to 10.000 or more. The class instances have several properties that I need from time to time, but their most important one is the ID.

class Document
  attr_accessor :id
  def ==(document)
    document.id == self.id
  end
end

Now, what is the fastest way of storing thousands of these objects?

I used to put them all into an array of Documents:

documents = Array.new
documents << Document.new
# etc

Now an alternative would be to store them in a Hash:

documents = Hash.new
doc = Document.new
documents[doc.id] = doc
# etc

In my application, I mostly need to find out whether a document exists at all. Is the Hash's has_key? function significantly faster than a linear search of the Array and the comparison of Document objects? Are both within O(n) or is has_key? even O(1) . Will I see the difference?

Also, sometimes I need to add Documents when it is already existing. When I use an Array, I would have to check with include? before, when I use a Hash, I'd just use has_key? again. Same question as above.

What are your thoughts? What is the fastest method of storing large amounts of data when 90% of the time I only need to know whether the ID exists (not the object itself!)

Answer 1

Hashes are much faster for lookups:

require 'benchmark'
Document = Struct.new(:id,:a,:b,:c)
documents_a = []
documents_h = {}
1.upto(10_000) do |n|
  d = Document.new(n)
  documents_a << d
  documents_h[d.id] = d
end
searchlist = Array.new(1000){ rand(10_000)+1 }

Benchmark.bm(10) do |x|
  x.report('array'){searchlist.each{|el| documents_a.any?{|d| d.id == el}} }
  x.report('hash'){searchlist.each{|el| documents_h.has_key?(el)} }
end

#                user     system      total        real
#array       2.240000   0.020000   2.260000 (  2.370452)
#hash        0.000000   0.000000   0.000000 (  0.000695)

Answer 2

Ruby has a set class in its standard library, have you considering keeping an (additional) set of IDs only?

http://stdlib.rubyonrails.org/libdoc/set/rdoc/index.html

To quote the docs: "This is a hybrid of Array's intuitive inter-operation facilities and Hash's fast lookup".

Answer 3

When using unique values, you can use the Ruby Set which has been previously mentioned. Here are benchmark results. It's slightly slower than the hash though.

                 user     system      total        real
array        0.460000   0.000000   0.460000 (  0.460666)
hash         0.000000   0.000000   0.000000 (  0.000219)
set          0.000000   0.000000   0.000000 (  0.000273)

I simply added to @steenslag's code which can be found here https://gist.github.com/rsiddle/a87df54191b6b9dfe7c9 .

I used ruby 2.1.1p76 for this test.

Answer 4

Use a Set of Documents. It has most of the properties you want (constant-time lookup and does not allow duplicates),. Smalltalkers would tell you that using a collection that already has the properties you want is most of the battle.
Use a Hash of Documents by document id, with ||= for conditional insertion (rather than has_key?).

Hashes are designed for constant-time insertion and lookup. Ruby's Set uses a Hash internally.

Be aware that your Document objects will need to implement #hash and #eql? properly in order for them to behave as you would expect as Hash keys or members of a set, as these are used to define hash equality.

Performance of Arrays and Hashes in Ruby

Question

4 answers

solution1
99 ACCPTED 2011-04-05 12:47:54

solution2
5 2011-04-05 12:17:40

solution3
4 2014-10-02 13:06:24

solution4
2 2011-04-05 17:49:56

Performance of Arrays and Hashes in Ruby

Question

4 answers

solution1 99 ACCPTED 2011-04-05 12:47:54

solution2 5 2011-04-05 12:17:40

solution3 4 2014-10-02 13:06:24

solution4 2 2011-04-05 17:49:56

solution1
99 ACCPTED 2011-04-05 12:47:54

solution2
5 2011-04-05 12:17:40

solution3
4 2014-10-02 13:06:24

solution4
2 2011-04-05 17:49:56