简体   繁体   中英

How do I write a regular expression that is capable of matching one or two lines of text

I trying to match some text that can be one or two-lined. I'd like to be able to handle the both scenarios in an efficient manner. The text string will be consistently formatted and will contain several tabs. I'm trying to do the matches in ruby. The text follow:

Single line:

#3  Hello Stormy    Scratched - Reason Unavailable                           11:10 AM ET 

Two-line:

#3  Hello Stormy    Scratched - Reason Unavailable                            11:10 AM ET   
                    Scratch Reason - Reason Unavailable changed to Trainer     2:19 PM ET  

I've had to use spaces to format the strings here, but the actual text uses tabs to separate the various sections: number and name, Scratched and reason and time.

Sample output:

One line: #3 Hello Stormy Scratched - Reason Unavailable 11:10AM ET

Two line #3 Hello Stormy Scratched - Reason Unavailable changed to Trainer 2:19PM

Note: Ideally the two line output would include the number and name from the first line.

I'm able to build an expression that matches various sections, but the tabs,second line and requirement to have the number and horse name on the two line output are giving me trouble.

You don't need a fancy regular expression to do what you want, you just need to know how to go about it.

Ruby's Enumerable has a method called slice_before that takes a regular expression, used to determine which elements in the array are grouped together. Array inherits that from Enumerable. For instance:

text = '#3  Hello Stormy    Scratched   -   Reason Unavailable          11:10 AM ET
#3  Hello Stormy    Scratched   -   Reason Unavailable          11:10 AM ET
                        Scratch Reason  -   Reason Unavailable changed to Trainer   2:19 PM ET
'

data = text.split("\n").slice_before(/\A\S/).to_a

require 'pp'
pp data

Outputs:

[["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET"],
["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET",
  "\t\t\tScratch\tReason\t-\tReason Unavailable changed to Trainer\t2:19 PM ET"]]

In other words, the array created by splitting the text on "\\n" is grouped by lines that do not start with white-space, which is the pattern /\\A\\S/ . All single lines are in separate sub-arrays. Lines that are continuations of the previous line are grouped with that line.

If you're reading a file from disk, you can use IO.readlines to read the file as an array, avoiding the need to split the file.

You can process that array further, if you want, to reconstruct the lines and continuation lines, using something like:

data = text.split("\n").slice_before(/\A\S/).map{ |i| i.join("\n") }

Which turns data into:

["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET",
"#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET\n\t\t\tScratch\tReason\t-\tReason Unavailable changed to Trainer\t2:19 PM ET"]

If you need to split each line into its component fields, use split("\\t") . How to do that across the sub-arrays is left as a exercise for you, but I'd involve map .


EDIT:

...I like your solution, but I'm getting undefined method for slice_before.

Try this:

require 'pp'
require 'rubygems'

class Array

  unless Array.respond_to?(:slice_before)
    def slice_before(pat)
      result = []
      temp_result = []
      self.each do |i|

        if (temp_result.empty?)
          temp_result << i
          next
        end

        if i[pat]
          result << temp_result
          temp_result = []
        end

        temp_result << i
      end
      result << temp_result

    end
  end

end

Calling that:

ary = [
  '#3  Hello Stormy    Scratched - Reason Unavailable                           11:10 AM ET',
  '#3  Hello Stormy    Scratched - Reason Unavailable                            11:10 AM ET',
  '                    Scratch Reason - Reason Unavailable changed to Trainer     2:19 PM ET',
]

pp ary.slice_before(/\A\S/)

Looks like:

[
  ["#3  Hello Stormy    Scratched - Reason Unavailable                           11:10 AM ET"],
  ["#3  Hello Stormy    Scratched - Reason Unavailable                            11:10 AM ET",
   "                    Scratch Reason - Reason Unavailable changed to Trainer     2:19 PM ET"]
]

It gets rather simplified if you can assume that the '#' character does not appear anywhere else in the string. Then something like this should do it:

 /^#[^#]*/m

Another more generic approach is to match first line starting with #, and any lines after that starting with space or tab:

 /^#.*?$(\n^[ \t].*?$)*/m

If the line doesn't always start with #, you could replace it with [^ \\t] (not space or tab).

Fun with REs! This is hacky, but there's a few different types of matching strategies in there.

# Two-line example
s = <<-EOS
  #3\tHello Stormy\t\tScratched - Reason Unavailable\t\t\t11:10 AM ET\t
  \t\t\tScratch Reason - Reason Unavailable changed to Trainer\t2:19 PM ET
EOS
# allow leading/trailing whitespace, get the number, name, last reason and time
s =~ /\A\s*(#\d)\t+([^\t]+)(?:\t+.*)?(?:\t+(.*))\t+(\d+:\d+ (?:AM|PM) ET)\s*\Z/m
# ["#3", "Hello Stormy", "Scratch Reason - Reason Unavailable changed to Trainer", "2:19 PM ET"]
a = $1, $2, $3, $4

Note: this assumes only one message in the string that you're matching
Note: not tested for the single-line case :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM