Can the performance of this type-selection be improved?

Question

Assuming I get some data like { :type => 'X', :some_other_key => 'foo' } on runtime and depending on some conditions I want to initialize the corresponding class for it. Our way to do this is like this.

TYPE_CLASSES = [
  TypeA,
  TypeB,
  TypeC,
  # ...
  TypeUnknown
]

TYPE_CLASSES.detect {|type| type.responsible_for?(data)}.new

We iterate over a list of classes and ask each one if it is responsible for the given data and initialize the first one found.

The order of the TYPE_CLASSES is important and some responsible_for? methods do not only check the type but also other keys inside of data . So some specialized class checking for type == 'B' && some_other_key == 'foo' has to come before a generalized class checking only for type == 'B' .

This works fine and is easily extensible, but TYPE_CLASSES list is already quite long, so in the worst case finding out the right type could result in iterating until the last element and calling for each type the responsible_for? check.

Is there any way to improve the performance and avoid iterating over each element while still preserving the order of the checks?

Answer 1

If matching the data set to classes is as complex as you described it, it might make sense to use decision tree building algorithms ( example ).

You can use AI4R library to do that in Ruby.

Probably you don't need to build that tree dynamically. So you can just use the library to basically generate optimized detection strategy for you, example from the documentation:

DATA_LABELS = [ 'city', 'age_range', 'gender', 'marketing_target'  ]
DATA_SET = [  
   ['New York',  '<30',      'M',  'Y'],
         ['Chicago',   '<30',      'M',  'Y'],
         ['Chicago',   '<30',      'F',  'Y'],
         ['New York',  '<30',      'M',  'Y'],
         ['New York',  '<30',      'M',  'Y'],
         ['Chicago',   '[30-50)',  'M',  'Y'],
         ['New York',  '[30-50)',  'F',  'N'],
         ['Chicago',   '[30-50)',  'F',  'Y'],
         ['New York',  '[30-50)',  'F',  'N'],
         ['Chicago',   '[50-80]',  'M',  'N'],
         ['New York',  '[50-80]',  'F',  'N'],
         ['New York',  '[50-80]',  'M',  'N'],
         ['Chicago',   '[50-80]',  'M',  'N'],
         ['New York',  '[50-80]',  'F',  'N'],
         ['Chicago',   '>80',      'F',  'Y']
       ]
id3 = ID3.new(DATA_SET, DATA_LABELS)
id3.get_rules
# =>  if age_range=='<30' then marketing_target='Y'
  elsif age_range=='[30-50)' and city=='Chicago' then marketing_target='Y'
  elsif age_range=='[30-50)' and city=='New York' then marketing_target='N'
  elsif age_range=='[50-80]' then marketing_target='N'
  elsif age_range=='>80' then marketing_target='Y'
  else raise 'There was not enough information during training to do a proper induction for this data element' end

(So you basically will be able to take that last line insert it into your code.)

You need to choose enough already classified records to make DATA_SET and DATA_LABELS, and also you need to convert your hashes into arrays (which isn't that difficult – basically your hashes' keys are DATA_LABELS , and your hashes values are values of DATA_SET array).

When you add new TYPE_CLASS , just retry the 'teaching' and update your detection code.

Can the performance of this type-selection be improved?

Question

1 answers

solution1
1 2015-08-04 11:30:22

Can the performance of this type-selection be improved?

Question

1 answers

solution1 1 2015-08-04 11:30:22

solution1
1 2015-08-04 11:30:22