简体   繁体   中英

Ruby regex into array of hashes but need to drop a key/val pair

I'm trying to parse a file containing a name followed by a hierarchy path. I want to take the named regex matches, turn them into Hash keys, and store the match as a hash. Each hash will get pushed to an array (so I'll end up with an array of hashes after parsing the entire file. This part of the code is working except now I need to handle bad paths with duplicated hierarchy (top_* is always the top level). It appears that if I'm using named backreferences in Ruby I need to name all of the backreferences. I have gotten the match working in Rubular but now I have the p1 backreference in my resultant hash.

Question: What's the easiest way to not include the p1 key/value pair in the hash? My method is used in other places so we can't assume that p1 always exists. Am I stuck with dropping each key/value pair in the array after calling the s_ary_to_hash method?

NOTE: I'm keeping this question to try and solve the specific issue of ignoring certain hash keys in my method. The regex issue is now in this ticket: Ruby regex - using optional named backreferences

UPDATE: Regex issue is solved, the hier is now always stored in the named 'hier' group. The only item remaining is to figure out how to drop the 'p1' key/value if it exists prior to creating the Hash.

Example file:

name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops  top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog

Expected output:

[{:name => "name1", :hier => "top_cat/mouse/dog/elephant/horse"},
 {:name => "new12", :hier => "top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
 {:name => "tops",  :hier => "top_bat/car[0]"},
 {:name => "ab123", :hier => "top_2/top_1/top_3/top_4/dog"}]

Code snippet:

def s_ary_to_hash(ary, regex)
  retary = Array.new
  ary.each {|x| (retary << Hash[regex.match(x).names.map{|key| key.to_sym}.zip(regex.match(x).captures)]) if regex.match(x)}
  return retary
end

regex = %r{(?<name>\w+) (?<p1>[\w\/\[\]]+)?(?<hier>(\k<p1>.*)|((?<= ).*$))}
h_ary = s_ary_to_hash(File.readlines(filename), regex)

What about this regex ?

^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$

Demo

http://rubular.com/r/awEP9Mz1kB

Sample code

def s_ary_to_hash(ary, regex, mappings)
   retary = Array.new

   for item in ary
      tmp = regex.match(item)
      if tmp then
         hash = Hash.new
         retary.push(hash)
         mappings.each { |mapping|
            mapping.map { |key, groups|
              for group in group
                 if tmp[group] then
                     hash[key] = tmp[group]
                     break
                 end
              end 
            }
         }
      end
   end

  return retary
end

regex = %r{^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$}
h_ary = s_ary_to_hash(
   File.readlines(filename), 
   regex,
   [ 
      {:name => ['name']},
      {:hier => ['hier','p1']}
   ]
)

puts h_ary

Output

{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse\r"}
{:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool\r"}
{:name=>"tops", :hier=>"top_bat/car[0]"}

Discussion

Since Ruby 2.0.0 doesn't support branch reset, I have built a solution that add some more power to the s_ary_to_hash function. It now admits a third parameter indicating how to build the final array of hashes.

This third parameter is an array of hashes. Each hash in this array has one key ( K ) corresponding to the key in the final array of hashes. K is associated with an array containing the named group to use from the passed regex (second parameter of s_ary_to_hash function).

If a group equals nil , s_ary_to_hash skips it for the next group.

If all groups equal nil , K is not pushed on the final array of hashes. Feel free to modify s_ary_to_hash if this isn't a desired behavior.

Edit: I've changed the method s_ary_to_hash to conform with what I now understand to be the criterion for excluding directories, namely, directory d is to be excluded if there is a downstream directory with the same name, or the same name followed by a non-negative integer in brackets. I've applied that to all directories, though I made have misunderstood the question; perhaps it should apply to the first.

data =<<THE_END
name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops  top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
THE_END

text = data.split("\n")

def s_ary_to_hash(ary)
  ary.map do |s| 
    name, _, downstream_path = s.partition(' ').map(&:strip)
    arr = []
    downstream_dirs = downstream_path.split('/')
    downstream_dirs.each {|d| puts "'#{d}'"}
    while downstream_dirs.any? do
      dir = downstream_dirs.shift
      arr << dir unless downstream_dirs.any? { |d|
        d == dir || d =~ /#{dir}\[\d+\]/ }
    end     
    { name: name, hier: arr.join('/') }
  end   
end

s_ary_to_hash(text)
  # => [{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse"},
  #     {:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
  #     {:name=>"tops", :hier=>"top_bat/car[0]"},
  #     {:name=>"ab123", :hier=>"top_2/top_1/top_3/top_4/dog"}] 

The exclusion criterion is implement in downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\\[\\d+\\]/ } downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\\[\\d+\\]/ } downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\\[\\d+\\]/ } , where dir is the directory that is being tested and downstream_dirs is an array of all the downstream directories. (When dir is the last directory, downstream_dirs is empty.) Localizing it in this way makes it easy to test and change the exclusion criterion. You could shorten this to a single regex and/or make it a method:

dir exclude_dir?(dir, downstream_dirs)
  downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ }end
end

Here is a non regexp solution:

result = string.each_line.map do |line|
  name, path = line.split(' ')
  path = path.split('/')
  last_occur_of_root = path.rindex(path.first)
  path = path[last_occur_of_root..-1]
  {name: name, heir: path.join('/')}
end

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM