简体   繁体   English

如何使用 Ruby 正则表达式从字符串中提取重复字符序列?

[英]How to extract repeating character sequences from a string with Ruby regex?

I have such a string "++++001------zx.......?????????xxxxxxx" I would like to extract the more than one length continuous sequences into a flattened array with a Ruby regex:我有这样一个字符串 "++++001------zx.......?????????xxxxxx" 我想将多个长度的连续序列提取到一个带有 Ruby 正则表达式的扁平数组:

["++++",
"00",
"------",
".......",
"?????????",
"xxxxxxx"]

I can achieve this with a nested loop:我可以通过嵌套循环来实现这一点:

s="++++001------zx.......?????????xxxxxxx"
t=s.split(//)
i=0
f=[]
while i<=t.length-1 do
  j=i
  part=""
  while t[i]==t[j] do
    part=part+t[j]
    j=j+1
  end
  i=j
  if part.length>=2 then f.push(part) end
end

But I am unable to find an appropriate regex to feed into the scan method.但是我找不到合适的正则表达式来输入扫描方法。 I tried this: s.scan(/(.)\\1++/x) but it only captures the first character of the repeating sequences.我试过这个: s.scan(/(.)\\1++/x)但它只捕获重复序列的第一个字符。 Is it possible at all?有可能吗?

This is a bit tricky.这有点棘手。

You do want to capture any group that is more than one of any given character.您确实希望捕获超过任何给定字符之一的任何组。 So a good way to do this is using backreferences.所以这样做的一个好方法是使用反向引用。 Your solution is close to being correct.您的解决方案接近正确。

/((.)\\2+)/ should do the trick. /((.)\\2+)/应该可以解决问题。

Note that if you use scan, this will return two values for each match group.请注意,如果您使用扫描,这将为每个匹配组返回两个值。 The first being the sequence, and the second being the value.第一个是序列,第二个是值。

str =  "++++001------zx.......?????????xxxxxxx" 
str.chars.chunk{|e| e}.map{|e| e[1].join if e[1].size >1 }.compact
# => ["++++", "00", "------", ".......", "?????????", "xxxxxxx"]

In case you need to get overall match values only while ignoring (omitting) all capturing group values, similarly to how String#match works in JavaScript, you can use a String#gsub with a single regex argument (no replacement argument) to return an Enumerator , with .to_a to get the array of matches:如果您只需要在忽略(省略)所有捕获组值时获取整体匹配值,类似于String#match在 JavaScript String#match工作方式,您可以使用带有单个正则表达式参数(无替换参数)的String#gsub返回一个Enumerator ,使用.to_a获取匹配数组:

text = "++++001------zx.......?????????xxxxxxx" 
p text.gsub(/(.)\1+/m).to_a
# => ["++++", "00", "------", ".......", "?????????", "xxxxxxx"]

See the Ruby demo online and the Rubular demo (note how the matches are highlighted in the Match result field).查看在线 Ruby 演示Rubular 演示(注意匹配结果字段中的匹配项是如何突出显示的)。

I added m modifier just for completeness, for the .我添加了m修饰符只是为了完整性,对于. to also match line break chars that a .还匹配 a 的换行符字符. does not match by default.默认不匹配。

Also, see a related Capturing groups don't work as expected with Ruby scan method thread.此外,请参阅相关的捕获组在使用 Ruby 扫描方法线程时无法按预期工作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM