具有命名捕获组的正则表达式获取Ruby中的所有匹配项

Question

我有一个字符串：

s="123--abc,123--abc,123--abc"

我尝试使用Ruby 1.9的新功能“命名组”来获取所有命名的组信息：

/(?<number>\d*)--(?<chars>\s*)/

是否有像Python的findall这样的API返回一个matchdata集合？ 在这种情况下，我需要返回两个匹配，因为123和abc重复两次。 每个匹配数据包含每个命名捕获信息的详细信息，因此我可以使用m['number']来获取匹配值。

Answer 1

命名捕获仅适用于一个匹配结果。
Ruby的findall类似于String#scan 。 您可以将scan结果用作数组，也可以将块传递给它：

irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"

irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]

irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb*     p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"

Answer 2

超级迟到，但这是一种复制String＃scan的简单方法，但获取matchdata：

matches = []
foo.scan(regex){ matches << $~ }

matches现在包含与扫描字符串相对应的MatchData对象。

Answer 3

您可以使用names方法从regexp中提取已使用的变量。 所以我做的是，我使用常规scan方法来获取匹配，然后使用压缩名称和每个匹配来创建Hash 。

class String
  def scan2(regexp)
    names = regexp.names
    scan(regexp).collect do |match|
      Hash[names.zip(match)]
    end
  end
end

用法：

>> "aaa http://www.google.com.tr aaa https://www.yahoo.com.tr ddd".scan2 /(?<url>(?<protocol>https?):\/\/[\S]+)/
=> [{"url"=>"http://www.google.com.tr", "protocol"=>"http"}, {"url"=>"https://www.yahoo.com.tr", "protocol"=>"https"}]

Answer 4

我最近需要类似的东西。 这应该像String#scan一样工作，但返回一个MatchData对象数组。

class String
  # This method will return an array of MatchData's rather than the
  # array of strings returned by the vanilla `scan`.
  def match_all(regex)
    match_str = self
    match_datas = []
    while match_str.length > 0 do 
      md = match_str.match(regex)
      break unless md
      match_datas << md
      match_str = md.post_match
    end
    return match_datas
  end
end

在REPL中运行示例数据会导致以下结果：

> "123--abc,123--abc,123--abc".match_all(/(?<number>\d*)--(?<chars>[a-z]*)/)
=> [#<MatchData "123--abc" number:"123" chars:"abc">,
    #<MatchData "123--abc" number:"123" chars:"abc">,
    #<MatchData "123--abc" number:"123" chars:"abc">]

您可能还会发现我的测试代码很有用：

describe String do
  describe :match_all do
    it "it works like scan, but uses MatchData objects instead of arrays and strings" do
      mds = "ABC-123, DEF-456, GHI-098".match_all(/(?<word>[A-Z]+)-(?<number>[0-9]+)/)
      mds[0][:word].should   == "ABC"
      mds[0][:number].should == "123"
      mds[1][:word].should   == "DEF"
      mds[1][:number].should == "456"
      mds[2][:word].should   == "GHI"
      mds[2][:number].should == "098"
    end
  end
end

Answer 5

@Nakilon正确显示正则表达式的scan ，但如果你不想，你甚至不需要冒险进入正则表达式：

s = "123--abc,123--abc,123--abc"
s.split(',')
#=> ["123--abc", "123--abc", "123--abc"]

s.split(',').inject([]) { |a,s| a << s.split('--'); a }
#=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]

这将返回一个数组数组，如果您有多个匹配项并且需要查看/处理它们，这将很方便。

s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
#=> {"123"=>"abc"}

这将返回一个哈希值，因为元素具有相同的键，所以只有唯一键值。 当你有一堆重复的键但想要独特的键时，这是很好的。 如果您需要与键相关联的唯一值，则会出现其缺点，但这似乎是一个不同的问题。

Answer 6

如果使用ruby> = 1.9和命名捕获，您可以：

class String 
  def scan2(regexp2_str, placeholders = {})
    return regexp2_str.to_re(placeholders).match(self)
  end

  def to_re(placeholders = {})
    re2 = self.dup
    separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
    #Search for the pattern placeholders and replace them with the regex
    placeholders.each do |placeholder, regex|
      re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
    end    
    return Regexp.new(re2, Regexp::MULTILINE)    #Returns regex using named captures.
  end
end

用法（ruby> = 1.9）：

> "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">

要么

> re="num4:name".to_re(num4:'\d{4}', name:'\w+')
=> /(?<num4>\d{4}):(?<name>\w+)/m

> m=re.match("1234:Kalle")
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
> m[:num4]
=> "1234"
> m[:name]
=> "Kalle"

使用分隔符选项：

> "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">

Answer 7

我真的很喜欢@ Umut-Utkan的解决方案，但它并没有完全按照我想要的方式进行，所以我重写了一下（注意，下面可能不是很漂亮的代码，但似乎有效）

class String
  def scan2(regexp)
    names = regexp.names
    captures = Hash.new
    scan(regexp).collect do |match|
      nzip = names.zip(match)
      nzip.each do |m|
        captgrp = m[0].to_sym
        captures.add(captgrp, m[1])
      end
    end
    return captures
  end
end

现在，如果你这样做

p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)

你得到

{:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}

（即，在一个数组中找到的所有字母字符，以及在另一个数组中找到的所有数字）。 根据您的扫描目的，这可能很有用。 无论如何，我喜欢看到只用几行就可以轻松地重写或扩展核心Ruby功能的例子！

Answer 8

一年前，我想要更容易阅读并命名为捕获的正则表达式，所以我对String进行了以下添加（应该可能不在那里，但当时很方便）：

scan2.rb：

class String  
  #Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
  #Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
  #the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
  #Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
  #but is needed for the method to see the names to be used as indices.
  def scan2(regexp2_str, mark='#')
    regexp              = regexp2_str.to_re(mark)                       #Evaluates the strings. Note: Must be reachable from here!
    hash_indices_array  = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
    match_array         = self.scan(regexp)

    #Save matches in hash indexed by string variable names:
    match_hash = Hash.new
    match_array.flatten.each_with_index do |m, i|
      match_hash[hash_indices_array[i].to_sym] = m
    end
    return match_hash  
  end

  def to_re(mark='#')
    re = /#{mark}(.*?)#{mark}/
    return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE)    #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
  end

end

用法示例（irb1.9）：

> load 'scan2.rb'
> AREA = '\d+'
> PHONE = '\d+'
> NAME = '\w+'
> "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
=> {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}

笔记：

当然，将模式（例如AREA，PHONE ...）放在散列中并将带有模式的散列添加到scan2的参数中会更优雅。

Answer 9

我喜欢John给出的match_all，但我认为它有错误。

这条线：

  match_datas << md

如果正则表达式中没有捕获（），则有效。

此代码提供整个行，包括正则表达式匹配/捕获的模式。 （MatchData的[0]部分）如果正则表达式具有capture（），则该结果可能不是用户（我）在最终输出中想要的结果。

我认为在regex中有capture（）的情况下，正确的代码应该是：

  match_datas << md[1]

match_datas的最终输出将是从match_datas [0]开始的模式捕获匹配数组。 如果需要正常的MatchData，这可能是预期的，其中包括match_datas [0]值，该值是整个匹配的子串，后跟match_datas [1]，match_datas [[2]，..这是捕获（如果有的话））在正则表达式模式中。

事情很复杂 - 这可能就是为什么match_all不包含在原生MatchData中的原因。

Answer 10

撇开Mark Hubbart的回答，我添加了以下猴子补丁：

class ::Regexp
  def match_all(str)
    matches = []
    str.scan(self) { matches << $~ }

    matches
  end
end

可以用作/（？< /(?<letter>\\w)/.match_all('word') . /(?<letter>\\w)/.match_all('word') ，并返回：

[#<MatchData "w" letter:"w">, #<MatchData "o" letter:"o">, #<MatchData "r" letter:"r">, #<MatchData "d" letter:"d">]

正如其他人所说，这依赖于在扫描块中使用$~作为匹配数据。

具有命名捕获组的正则表达式获取Ruby中的所有匹配项

问题描述

10 个解决方案

解决方案1
30 已采纳 2011-01-14 15:29:34

解决方案2
21 2012-12-11 09:53:40

解决方案3
9 2012-02-28 16:08:30

解决方案4
2 2012-08-10 15:28:52

解决方案5
2 2011-01-14 18:19:12

解决方案6
2 2011-02-22 16:31:40

解决方案7
1 2012-04-18 19:38:09

解决方案8
1 2011-02-22 11:53:08

解决方案9
0 2013-08-25 19:09:16

解决方案10
0 2014-12-10 01:40:28

具有命名捕获组的正则表达式获取Ruby中的所有匹配项

问题描述

10 个解决方案

解决方案1 30 已采纳 2011-01-14 15:29:34

解决方案2 21 2012-12-11 09:53:40

解决方案3 9 2012-02-28 16:08:30

解决方案4 2 2012-08-10 15:28:52

解决方案5 2 2011-01-14 18:19:12

解决方案6 2 2011-02-22 16:31:40

解决方案7 1 2012-04-18 19:38:09

解决方案8 1 2011-02-22 11:53:08

解决方案9 0 2013-08-25 19:09:16

解决方案10 0 2014-12-10 01:40:28

解决方案1
30 已采纳 2011-01-14 15:29:34

解决方案2
21 2012-12-11 09:53:40

解决方案3
9 2012-02-28 16:08:30

解决方案4
2 2012-08-10 15:28:52

解决方案5
2 2011-01-14 18:19:12

解决方案6
2 2011-02-22 16:31:40

解决方案7
1 2012-04-18 19:38:09

解决方案8
1 2011-02-22 11:53:08

解决方案9
0 2013-08-25 19:09:16

解决方案10
0 2014-12-10 01:40:28