How do I improve process performance converting a file to array of hashes?

Question

I am using this method to process a single text file that has about 220,000 lines. It takes a few minutes to process one, but I have lots of them. Are there any recommendations to make this process faster?

def parse_list(file_path,import=false)
# Parse the fixed-length fields
   if File.exist?(file_path)
     result=[]
     File.readlines(file_path)[5..-1].each do |rs|
        if rs.length > 140
          r=rs.strip
          unless r=='' 
            filing={
                  'name' => r[0..50].strip,
                  'form' => r[51..70].strip,
                  'type'  => r[71..80].strip,
                  'date' => r[81..90].strip,
                  'location' => r[91..-1].strip
                  }     
              result.push(filing)
          end
        end
     end
     return result
   else
     return false
   end
end

Update:

Originally, I thought there was massive time savings from using Nex and thetinman's methods so I went on to test them keeping the parsing method consistent.

Using my original r[].strip parsing method, but with Nex's each_line block method and thetinman's foreach methods:

Rehearsal ---------------------------------------------
Nex         8.260000   0.130000   8.390000 (  8.394067)
Thetinman   9.740000   0.120000   9.860000 (  9.862880)
----------------------------------- total: 18.250000sec

                user     system      total        real
Nex        14.270000   0.140000  14.410000 ( 14.397286)
Thetinman  19.030000   0.080000  19.110000 ( 19.118621)

Running again using thetinman's unpack.map parsing method:

Rehearsal ---------------------------------------------
Nex         9.580000   0.120000   9.700000 (  9.694327)
Thetinman  11.470000   0.090000  11.560000 ( 11.567294)
----------------------------------- total: 21.260000sec

                user     system      total        real
Nex        15.480000   0.120000  15.600000 ( 15.599319)
Thetinman  18.150000   0.070000  18.220000 ( 18.217744)

unpack.map(&:strip) vs r[].strip : unpack with map does not seem to increase speed, but is an interesting method to use in the future.

I found a different issue: With the substantial time savings found, I thought, I went on to run Nex and thetinman's methods manually using pry. This is where I found my computer hanging, just like my original code. So I went on to test again, but with my original code.

Rehearsal ---------------------------------------------
Original    7.980000   0.140000   8.120000 (  8.118340)
Nex         9.460000   0.080000   9.540000 (  9.546889)
Thetinman  10.980000   0.070000  11.050000 ( 11.042459)
----------------------------------- total: 28.710000sec

                user     system      total        real
Original   16.280000   0.140000  16.420000 ( 16.414070)
Nex        15.370000   0.080000  15.450000 ( 15.454174)
Thetinman  20.100000   0.090000  20.190000 ( 20.195533)

My code, Nex, and thetinman's methods seem comparable, with Nex being the fastest using Benchmark. However, Benchmark does not seem to tell the whole story because using pry to test the code manually gets all the methods to take substantially longer, so long that I cancel out before getting the result back.

I have some remaining questions:

Is there something specific about running something like this in IRB/Pry that would produce these strange results, making the code run massively slower?
If I run original_method.count , nex_method.count , or thetinmans_method.count , they all seem to return quickly.
Due to memory issues and scalability, it is recommended by thetinman and nex that the original method should not be used. However, in the future are there ways to test memory usage with something like benchmark?

Update for NEX, using activerecord-import :

@nex, is this what you mean? This seems to run slow for me still, but I'm not sure what you mean when you say:

import one set of data inside that block.

How do you recommend modifying it?

def parse_line(line)
   filing={
   'name' => line[0..50].strip,
   'form' => line[51..70].strip,
   'type'  => line[71..80].strip,
   'date' => line[81..90].strip,
    'location' => line[91..-1].strip
    }    
end

def import_files
 result=[]
 parse_list_nix(file_path){|line|
    filing=parse_line(line)    
    result.push(Filing.new(filing))

 }
 Filing.import result   #result is an array of new records that are all imported at once
end

Results from the activerecord-import method are, as you can see, substantially slower:

Rehearsal ------------------------------------------
import 534.840000   1.860000 536.700000 (553.507644)
------------------------------- total: 536.700000sec

             user     system      total        real
import 263.220000   1.320000 264.540000 (282.751891)

Does this slow import process seem normal?

It just seems super slow to me. I'm trying to figure out how to speed this up, but I am out of ideas.

Answer 1

Without sample data it's hard to confirm this, but, based on the original code, I'd probably write something like this:

require 'english'

# Parse the fixed-length fields
def parse_list(file_path,import=false)

  return false unless File.exist?(file_path)

  result=[]
  File.foreach(file_path) do |rs|
    next unless $INPUT_LINE_NUMBER > 5
    next unless rs.length > 140

    r = rs.strip
    if r > '' 
      name, form, type, date, location = r.unpack('A51 A20 A10 A10 A*').map(&:strip)
      result << {
        'name'     => name,
        'form'     => form,
        'type'     => type,
        'date'     => date,
        'location' => location
      }
    end
  end

  result
end

220,000 lines isn't a big file where I come from. We get log files 3x that by mid-morning, so using any file I/O that slurps the file is out. Ruby's IO class has two methods for line-by-line I/O and a number that return arrays. You want the former because they are scalable. Unless you can guarantee that the file being read will fit comfortably in Ruby's memory avoid the later.

Answer 2

The problem is that you're filling up your memory. What are you going to do with that result? Does it have to linger in your memory as a whole or would it be an option to just process it line by line with a block?

Also you should not use readlines here. Do something like this instead since it uses an Enumerator:

def parse_list(file_path, import=false)
  i = 0
  File.open(file_path,'r').each_line do |line|
    line.strip!
    next if (i+=1) < 5 || line.length < 141
    filing = { 'name' => r[0..50].strip,
               'form' => r[51..70].strip,
               'type'  => r[71..80].strip,
               'date' => r[81..90].strip,
               'location' => r[91..-1].strip }
    yield(filling) if block_given?
  end
end

# and calling it like this:
parse_list('/tmp/foobar'){ |filling|
  Filing.new(filing).import
}

How do I improve process performance converting a file to array of hashes?

Question

2 answers

solution1
2 2013-05-11 00:32:22

solution2
2 2013-05-11 00:37:05

How do I improve process performance converting a file to array of hashes?

Question

2 answers

solution1 2 2013-05-11 00:32:22

solution2 2 2013-05-11 00:37:05

solution1
2 2013-05-11 00:32:22

solution2
2 2013-05-11 00:37:05