简体   繁体   English

在ruby代码上改进正则表达式

[英]Improve a regular expression on a ruby code

I have the following script 我有以下脚本

file = File.new("jobs.txt", 'r')
h = {}
jobs = []
salaries= []
while (line = file.gets)
    if ( line =~ /CODE/)
        cargos << line.gsub("\n", "")
    elsif (line =~ /SALARY/)
            salarios << line.gsub("\n", "")
    end
end
h = Hash[jobs.zip(salaries)]
h.each { |code, salary| puts "#{code} ------ #{salary}" }

It gets the job done but I want to make the regex ignore the /CODE/ it matches and return the rest of the line, is it possible to do this only on the regex or I have to code it myself (replacing strings or something like that). 它完成了工作,但是我想让regex忽略它匹配的/ CODE /并返回其余的行,是否有可能仅在regex上做到这一点,或者我必须自己编写代码(替换字符串或类似的东西那)。

I am mostly trying to figure how to make the code as small as possible. 我主要试图弄清楚如何使代码尽可能的小。

Your code isn't very idiomatic. 您的代码不是很习惯。 This is untested but looks about right: 这未经测试,但看起来正确:

salarios = []
cargos = []
File.foreach("jobs.txt") do |line|

  if ( line =~ /CODE/)
    cargos << line[/CODE - (\S+)/, 1]
  elsif (line =~ /SALARY/)
    salarios << line[/SALARY - (\S+)/, 1]
  end

end
h = Hash[cargos.zip(salaries)]
h.each { |code, salary| puts "#{code} ------ #{salary}" }

line[/CODE = (\\S+)/, 1] takes advantage of String's [] method, which lets us pass in a number of different types of parameters. line[/CODE = (\\S+)/, 1]利用String的[]方法,该方法使我们可以传递许多不同类型的参数。 In this case I'm using a regex pattern with a capture. 在这种情况下,我使用带有捕获的正则表达式模式。 1 tells Ruby to return the first capture in the pattern: 1告诉Ruby返回模式中的第一个捕获:

'CODE - XXXXX'[/CODE - (\S+)/, 1] # => "XXXXX"

\\S+ means "one or more non-whitespace characters" so basically the pattern is saying "Find 'CODE - ' then capture the next string of characters until a space, tab, line-feed or carriage-return is found. \\S+表示“一个或多个非空白字符”,因此基本上该模式是说“查找'CODE-',然后捕获下一个字符串,直到找到空格,制表符,换行符或回车符为止。

An alternate way to find and capture the values is to take advantage of Ruby's setting of "magic" variables when a pattern matches and contains captures: 查找和捕获值的另一种方法是,当模式匹配并包含捕获时,利用Ruby的“魔术”变量设置:

if ( line =~ /CODE - (\S+)/)
  cargos << $1
elsif (line =~ /SALARY - (\S+)/)
  salarios << $1
end

Here's a bit of proof: 这里有一些证明:

'CODE - XXXXX' =~ /CODE - (\S+)/
$1 # => "XXXXX"

Some people don't like using the Regexp magic variables; 有些人不喜欢使用Regexp魔术变量。 As long as you use them immediately, before anything else has a chance to run another regex match, you'll be OK. 只要您立即使用,之前别的有机会运行另一个正则表达式匹配,你会没事的。 If another match occurs the variables can be overwritten and you'll have a bug. 如果发生另一个匹配,则变量可以被覆盖,并且您将遇到一个错误。

Back to your code. 回到您的代码。 Use foreach with a block to read the lines from the file, instead of opening and assigning to a variable. 与块一起使用foreach来读取文件中的行,而不是打开并分配给变量。 Ruby will automatically close the file after the block exits. 块退出后,Ruby会自动关闭文件。

If the input is something like this CODE - XXXXX SALARY - XXXX OTHER INFO - XXXX , to yield the codes: 如果输入是这样的CODE - XXXXX SALARY - XXXX OTHER INFO - XXXX ,则产生代码:

cargos << $1 if (line =~ /CODE\D+(\d+)/)
salarios << $1 if (line =~ /SALARY\D+(\d+)/)

Here the regular expression is matched for CODE followed by al least one non-digit ( \\D+ ), followed by captured digits, which are meant to represent the code. 在这里,正则表达式与CODE匹配,后跟至少一个非数字( \\D+ ),然后是捕获的数字,这些数字用于表示代码。

Hope it helps. 希望能帮助到你。

You need to be careful about searching for "CODE" and "SALARY" separately. 您需要注意分别搜索"CODE""SALARY" If there are problems with the data, you may never know it (eg, "...CODE...CODE....SALARY..." or an exception may be raised (if, for example, zip is executed when jobs and salaries are different sizes). 如果数据有问题,您可能永远都不知道(例如, "...CODE...CODE....SALARY..."或者会引发异常(例如,如果在以下情况下执行zipjobssalaries不同)。

Here's how I would do it. 这就是我要做的。 The method returns the desired result if the data is OK, else nil . 如果数据正常,则该方法返回期望的结果,否则返回nil

Code

def doit(lines)
  a = lines.select { |s| s =~ /CODE|SALARY/ }
  return nil unless a.size.even?
  jobs, salaries = a.each_slice(2).to_a.transpose
  return nil unless jobs.all?     { |l| l.scan(/CODE|SALARY/) == ["CODE"]   }
  return nil unless salaries.all? { |l| l.scan(/CODE|SALARY/) == ["SALARY"] }
  jobs.zip salaries
end

Examples 例子

text =<<-_
CODE and
SALARY are good, but
CODE without
any
SALARY is not
so good
_

doit(text.split("\n"))
  #=> [["CODE and", "SALARY are good, but"], ["CODE without", "SALARY is not"]]

text =<<-_
CODE and
SALARY are good, but
SALARY without
CODE is even
better
_

doit(text.split("\n"))
  #=> nil

text =<<-_
CODE and
SALARY are good, but
CODE is the main thing
_

doit(text.split("\n"))
  #=> nil

Explanation 说明

  • lines.select { |s| s =~ /CODE|SALARY/ } lines.select { |s| s =~ /CODE|SALARY/ } pulls out all the lines that contain the word "CODE" or "SELECT" . lines.select { |s| s =~ /CODE|SALARY/ }拔出所有包含单词"CODE""SELECT"
  • each_slice(2).to_a pairs the selected lines and then converts the enumerator to an array with two columns each_slice(2).to_a将选定的行配对,然后将枚举器转换为具有两列的数组
  • transpose extracts the columns of the array. transpose提取数组的列。 The first column should be be the "CODE" lines; 第一列应该是"CODE"行; the second column, the "SALARY" lines. 第二列, "SALARY"行。
  • jobs.all? { |l| l.scan(/CODE|SALARY/) == ["CODE"] } jobs.all? { |l| l.scan(/CODE|SALARY/) == ["CODE"] } ensures that each "CODE" line contains "CODE" exactly once and does not contain "SALARY" . jobs.all? { |l| l.scan(/CODE|SALARY/) == ["CODE"] }确保每个"CODE"行仅包含一次"CODE" ,并且不包含"SALARY" The next line of code does the opposite for the "SALARY" lines. 下一行代码对"SALARY"行执行相反的操作。
  • the last line returns the desired result. 最后一行返回所需的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM