简体   繁体   English

Ruby-Regex多行

[英]Ruby - Regex Multiple lines

I'm looking to run a search through some files to see if they have comments on top of the file. 我正在寻找对某些文件的搜索,以查看它们是否在文件顶部有注释。

Here's what I'm searching for: 这是我要搜索的内容:

#++
#    app_name/dir/dir/filename
#    $Id$
#--

I had this as a REGEX and came up short: 我将其作为REGEX,结果很简短:

:doc => { :test => '^#--\s+[filename]\s+\$Id'
if @file_text =~ Regexp.new(@rules[rule][:test])
....

Any suggestions? 有什么建议么?

Check this example: 检查以下示例:

string = <<EOF
#++
##    app_name/dir/dir/filename
##    $Id$
##--

foo bar
EOF

puts /#\+\+.*\n##.*\n##.*\n##--/.match(string)

The pattern matches two lines starting with ## between two lines starting with #++ and ending with #-- plus including those boundaries into the match. 模式匹配以##开头的两行,以#++开头并以#--结尾的两行,并将这些边界包括在匹配项中。 If I got the question right, this should be what you want. 如果我的问题正确,那应该就是您想要的。

You can generalize the pattern to match everything between the first #++ and the first #-- (including them) using the following pattern: 您可以使用以下模式来概括模式,以匹配第一个#++和第一个 #--之间的所有内容(包括它们):

puts /#\+\+.*?##--/m.match(string)

Rather than try to do it all in a single pattern, which will become difficult to maintain as your file headers change/grow, instead use several small tests which give you granularity. 与其尝试以单一模式进行所有操作,否则随着文件头的更改/增长而变得难以维护,而是使用多个小测试来提供粒度。 I'd do something like: 我会做类似的事情:

lines = '#++
#    app_name/dir/dir/filename
#    $Id$
#--
'

Split the text so you can retrieve the lines you want, and normalize them: 拆分文本,以便您可以检索所需的行并对其进行规范化:

l1, l2, l3, l4 = lines.split("\n").map{ |s| s.strip.squeeze(' ') }

This is what they contain now: 它们现在包含的内容是:

[l1, l2, l3, l4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]

Here's a set of tests, one for each line: 这是一组测试,每行一个:

!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/]) # => true

Here's what is being tested and what each returns: 这是正在测试的内容以及每个返回的内容:

l1[/^#\+\+/] # => "#++"
l2[/^#\s[\w\/]+/] # => "# app_name/dir/dir/filename"
l3[/^#\s\$Id\$/i] # => "# $Id$"
l4[/^#--/] # => "#--"

There are many different ways to grab the first "n" rows of a file. 抓取文件的前“ n”行有很多不同的方法。 Here's a few: 这里是一些:

File.foreach('test.txt').to_a[0, 4] # => ["#++\n", "#    app_name/dir/dir/filename\n", "#    $Id$\n", "#--\n"]
File.readlines('test.txt')[0, 4]    # => ["#++\n", "#    app_name/dir/dir/filename\n", "#    $Id$\n", "#--\n"]
File.read('test.txt').split("\n")[0, 4] # => ["#++", "#    app_name/dir/dir/filename", "#    $Id$", "#--"]

The downside of these is they all "slurp" the input file, which, on a huge file will cause problems. 这些的缺点是它们都“吸取”输入文件,在一个很大的文件上会引起问题。 It's trivial to write a piece of code that'd open a file, read the first four lines, and return them in an array. 编写一段打开文件,读取前四行并将它们返回到数组中的代码很简单。 This is untested but looks about right: 这未经测试,但看起来正确:

def get_four_lines(path)

  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end

  ary

end

Here's a quick little benchmark to show why I'd go this way: 以下是一个简短的基准测试,用以说明我为什么要这样做:

require 'fruity'

def slurp_file(path)
  File.read(path).split("\n")[0,4] rescue []
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  ary
rescue
  []
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Running that as root outputs: 作为根输出运行:

Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0

That's reading approximately 105 files in my /etc directory. 那正在我的/ etc目录中读取大约105个文件。

Modifying the test to actually parse the lines and test to return a true/false: 修改测试以实际解析行并测试返回true / false:

require 'fruity'

def slurp_file(path)
  ary = File.read(path).split("\n")[0,4] 
  !!(/#\+\+\n(.|\n)*?##\-\-/.match(ary.join("\n")))
rescue
  false # return a consistent value to fruity
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  !!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
  false # return a consistent value to fruity
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Running that again returns: 再次运行将返回:

Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0

Your benchmark isn't fair. 您的基准测试不公平。

Here's one that's "fair": 这是“公平”的:

require 'fruity'

def slurp_file(path)
  text = File.read(path)
  !!(/#\+\+\n(.|\n)*?##\-\-/.match(text))
rescue
  false # return a consistent value to fruity
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  !!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
  false # return a consistent value to fruity
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Which outputs: 哪个输出:

Running each test once. Test will take about 1 second.
read_four is similar to slurp

joining the split strings back into a longer string prior to doing the match was the wrong path, so working from the full file's content is a more-even test. 在进行匹配之前,将拆分后的字符串重新连接成更长的字符串是错误的路径,因此从完整文件的内容进行工作是一项更均匀的测试。

[...] Just read the first four lines and apply the pattern, that's it [...]只需阅读前四行并应用模式,就是这样

That's not just it. 不仅如此。 A multiline regex written to find information spanning multiple lines can't be passed single text lines and return accurate results, so it needs to get a long string. 编写用于查找跨越多行信息的多行正则表达式不能传递单个文本行并返回准确的结果,因此它需要获得一个长字符串。 Determining how many characters make up four lines would only add overhead, and slow the algorithm; 确定多少字符组成四行行只会增加开销,并使算法变慢; That's what the previous benchmark did and it wasn't "fair". 这就是以前的基准测试所做的,并且不是“公平的”。

Depends on your input data. 取决于您的输入数据。 If you would run this code over a complete (bigger) source code folder, it will slow down it significantly. 如果您在完整(更大)的源代码文件夹上运行此代码,则它将大大降低其运行速度。

There were 105+ files in the directory. 目录中有105多个文件。 That's a reasonably large number of files, but iterating over a large number of files will not show a difference as Ruby's ability to open files isn't the issue, it's the I/O speed of reading a file in one pass vs. line-by-line. 这是一个相当大的文件数量,但是对大量文件进行迭代不会显示出差异,因为Ruby的打开文件能力不是问题,这是一次读取文件的I / O速度与line-按行。 And, from experience I know the line-by-line I/O is fast. 而且,根据经验,我知道逐行I / O很快。 Again, a benchmark says: 同样,基准测试表明:

require 'fruity'

LITTLEFILE = 'little.txt'
MEDIUMFILE = 'medium.txt'
BIGFILE = 'big.txt'

LINES = '#++
#    app_name/dir/dir/filename
#    $Id$
#--
'

LITTLEFILE_MULTIPLIER = 1
MEDIUMFILE_MULTIPLIER = 1_000
BIGFILE_MULTIPLIER = 100_000

File.write(BIGFILE, LINES * BIGFILE_MULTIPLIER)

def _slurp_file(path)
  File.read(path)
  true # return a consistent value to fruity
end

def _read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  true # return a consistent value to fruity
end

[
  [LITTLEFILE, LITTLEFILE_MULTIPLIER],
  [MEDIUMFILE, MEDIUMFILE_MULTIPLIER],
  [BIGFILE,    BIGFILE_MULTIPLIER]
].each do |file, mult|

  File.write(file, LINES * mult)
  puts "Benchmarking against #{ file }"
  puts "%s is %d bytes" % [ file, File.size(file)]

  compare do
    slurp                     { _slurp_file(file)                }
    read_first_four_from_file { _read_first_four_from_file(file) }
  end

  puts
end

With the output: 随着输出:

Benchmarking against little.txt
little.txt is 49 bytes
Running each test 128 times. Test will take about 1 second.
slurp is similar to read_first_four_from_file

Benchmarking against medium.txt
medium.txt is 49000 bytes
Running each test 128 times. Test will take about 1 second.
read_first_four_from_file is faster than slurp by 39.99999999999999% ± 10.0%

Benchmarking against big.txt
big.txt is 4900000 bytes
Running each test 128 times. Test will take about 4 seconds.
read_first_four_from_file is faster than slurp by 100x ± 10.0

Reading a small file of four lines, read is as fast as foreach but once the file size increases the overhead of reading the entire file starts to impact the times. 读取一个四行的小文件, read速度与foreach一样快,但是一旦文件大小增加,读取整个文件的开销就会开始影响时间。

Any solution relying on slurping files is known to be a bad thing; 众所周知,依靠文件抓取的任何解决方案都是一件坏事。 It's not scalable, and can actually cause code to halt due to memory allocation if BIG files are encountered. 它不可扩展,如果遇到BIG文件,实际上可能由于内存分配而导致代码暂停。 Reading the first four lines will always run at a consistent speed independent of the file sizes, so use that technique EVERY time there is a chance that the file sizes will vary. 读取前四行将始终以与文件大小无关的一致速度运行,因此,每次使用该技术时,文件大小都有可能会有所不同。 Or, at least, be very aware of the impact on run times and potential problems that can be caused by slurping files. 或者,至少,要非常注意运行时间的影响以及文件提取可能引起的潜在问题。

You might want to try the following parttern: \\#\\+{2}(?:.|[\\r\\n])*?\\#\\-{2} 您可能需要尝试以下方法: \\#\\+{2}(?:.|[\\r\\n])*?\\#\\-{2}

正则表达式可视化

Working demo @ regex101 工作演示@ regex101

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM