简体   繁体   English

用正则表达式用line_2替换每次出现的“ line 2”

[英]replace every occurrence of 'line 2' with line_2 with regex

I'm parsing some text from an XML file which has sentences like "Subtract line 4 from line 1.", "Enter the amount from line 5" i want to replace all occurrences of line with line_ eg. 我正在解析XML文件中的一些文本,该文件具有类似“从第1行减去第4行”,“从第5行输入金额”之类的句子,我想用line_替换所有出现的line。 Subtract line 4 from line 1 --> Subtract line_4 from line_1 从第1行减去第4行->从第_1行减去第4行

Also, there are sentences like "Are the amounts on lines 4 and 8 the same?" 另外,还有这样的句子:“第4行和第8行的金额是否相同?” and "Skip lines 9 through 12; go to line 13." 和“跳过第9至12行;转到第13行。” I want to process these sentences to become "Are the amounts on line_4 and line_8 the same?" 我想将这些句子处理为“ line_4和line_8上的金额是否相同?” and "Skip line_9 through line_12; go to line_13." 和“跳过line_9到line_12;转到line_13。”

Here's a working implementation with rspec test. 这是rspec测试的可行实现。 You call it like this: output = LineIdentifier[input] . 您可以这样称呼: output = LineIdentifier[input] To test, spec file.rb after installing rspec gem. 要进行测试, spec file.rb在安装rspec gem之后使用spec file.rb

require 'spec'

class LineIdentifier
  def self.[](input)
    output = input.gsub /line (\d+)/, 'line_\1'
    output.gsub /lines (\d+) (and|from|through) (line )?(\d+)/, 'line_\1 \2 line_\4'
  end
end

describe "LineIdentifier" do
  it "should identify line mentions" do
    examples = { 
      #Input                                         Output
     'Subtract line 4 from line 1.'               => 'Subtract line_4 from line_1.',
     'Enter the amount from line 5'               => 'Enter the amount from line_5',
     'Subtract line 4 from line 1'                => 'Subtract line_4 from line_1',
    }
    examples.each do |input, output|
      LineIdentifier[input].should == output
    end
  end
  it "should identify line ranges" do
    examples = { 
      #Input                                         Output
     'Are the amounts on lines 4 and 8 the same?' => 'Are the amounts on line_4 and line_8 the same?',
     'Skip lines 9 through 12; go to line 13.'    => 'Skip line_9 through line_12; go to line_13.',
    }
    examples.each do |input, output|
      LineIdentifier[input].should == output
    end
  end
end

This works for the specific examples including the ones in the OP comments. 这适用于特定示例,包括OP注释中的示例。 As is often the case when using regex to do parsing, it becomes a hodge-podge of additional cases and tests to handle ever-increasing known inputs. 就像使用正则表达式进行解析的情况一样,它成为其他情况和测试的大杂烩,以处理不断增长的已知输入。 This handles the lists of line numbers using a while loop with a non-greedy match. 这使用带有非贪婪匹配的while循环来处理行号列表。 As written, it is simply processing an input line-by-line. 如所写,它只是在逐行处理输入。 To get series of line numbers across line boundaries, it would need to be changed to process it as one chunk with matching across lines. 要获得跨线边界的一系列线号,需要对其进行更改以将其处理为一个跨线匹配的块。

open( ARGV[0], "r" ) do |file|
  while ( line = file.gets )
    # replace both "line ddd" and "lines ddd" with line_ddd 
    line.gsub!( /(lines?\s)(\d+)/, 'line_\2' )
    # Now replace the known sequences with a non-greedy match
    while line.gsub!( /(line_\d+[a-z]?,?)(\sand\s|\sthrough\s|,\s)(\d+)/, '\1\2line_\3' )
    end
    puts line
  end
end

Sample Data : For this input: 样本数据 :对于此输入:

Subtract line 4 from line 1.
Enter the amount from line 5
on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
... on line 10 Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4

It produces this output: 它产生以下输出:

Subtract line_4 from line_1.
Enter the amount from line_5
on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
... on line_10 Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4

sed is your friend: sed是你的朋友:

lines.sed : lines.sed

#!/bin/sed -rf
s/lines? ([0-9]+)/line_\1/g
s/\b([0-9]+[a-z]?)\b/line_\1/g

lines.txt : lines.txt

Subtract line 4 from line 1.
Enter the amount from line 5
Are the amounts on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
Enter the total of the amounts from Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4

demo: 演示:

$ cat lines.txt | ./lines.sed
Subtract line_4 from line_1.
Enter the amount from line_5
Are the amounts on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
Enter the total of the amounts from Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4

You can also make this into a sed one-liner if you prefer, although the file is more maintainable. 您也可以根据需要将其制成sed单线格式,尽管该文件更易于维护。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM