I trying to match some text that can be one or two-lined. I'd like to be able to handle the both scenarios in an efficient manner. The text string will be consistently formatted and will contain several tabs. I'm trying to do the matches in ruby. The text follow:
Single line:
#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET
Two-line:
#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET
Scratch Reason - Reason Unavailable changed to Trainer 2:19 PM ET
I've had to use spaces to format the strings here, but the actual text uses tabs to separate the various sections: number and name, Scratched and reason and time.
Sample output:
One line: #3 Hello Stormy Scratched - Reason Unavailable 11:10AM ET
Two line #3 Hello Stormy Scratched - Reason Unavailable changed to Trainer 2:19PM
Note: Ideally the two line output would include the number and name from the first line.
I'm able to build an expression that matches various sections, but the tabs,second line and requirement to have the number and horse name on the two line output are giving me trouble.
You don't need a fancy regular expression to do what you want, you just need to know how to go about it.
Ruby's Enumerable has a method called slice_before
that takes a regular expression, used to determine which elements in the array are grouped together. Array inherits that from Enumerable. For instance:
text = '#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET
#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET
Scratch Reason - Reason Unavailable changed to Trainer 2:19 PM ET
'
data = text.split("\n").slice_before(/\A\S/).to_a
require 'pp'
pp data
Outputs:
[["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET"],
["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET",
"\t\t\tScratch\tReason\t-\tReason Unavailable changed to Trainer\t2:19 PM ET"]]
In other words, the array created by splitting the text on "\\n"
is grouped by lines that do not start with white-space, which is the pattern /\\A\\S/
. All single lines are in separate sub-arrays. Lines that are continuations of the previous line are grouped with that line.
If you're reading a file from disk, you can use IO.readlines
to read the file as an array, avoiding the need to split the file.
You can process that array further, if you want, to reconstruct the lines and continuation lines, using something like:
data = text.split("\n").slice_before(/\A\S/).map{ |i| i.join("\n") }
Which turns data
into:
["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET",
"#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET\n\t\t\tScratch\tReason\t-\tReason Unavailable changed to Trainer\t2:19 PM ET"]
If you need to split each line into its component fields, use split("\\t")
. How to do that across the sub-arrays is left as a exercise for you, but I'd involve map
.
EDIT:
...I like your solution, but I'm getting undefined method for slice_before.
Try this:
require 'pp'
require 'rubygems'
class Array
unless Array.respond_to?(:slice_before)
def slice_before(pat)
result = []
temp_result = []
self.each do |i|
if (temp_result.empty?)
temp_result << i
next
end
if i[pat]
result << temp_result
temp_result = []
end
temp_result << i
end
result << temp_result
end
end
end
Calling that:
ary = [
'#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET',
'#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET',
' Scratch Reason - Reason Unavailable changed to Trainer 2:19 PM ET',
]
pp ary.slice_before(/\A\S/)
Looks like:
[
["#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET"],
["#3 Hello Stormy Scratched - Reason Unavailable 11:10 AM ET",
" Scratch Reason - Reason Unavailable changed to Trainer 2:19 PM ET"]
]
It gets rather simplified if you can assume that the '#' character does not appear anywhere else in the string. Then something like this should do it:
/^#[^#]*/m
Another more generic approach is to match first line starting with #, and any lines after that starting with space or tab:
/^#.*?$(\n^[ \t].*?$)*/m
If the line doesn't always start with #, you could replace it with [^ \\t]
(not space or tab).
Fun with REs! This is hacky, but there's a few different types of matching strategies in there.
# Two-line example
s = <<-EOS
#3\tHello Stormy\t\tScratched - Reason Unavailable\t\t\t11:10 AM ET\t
\t\t\tScratch Reason - Reason Unavailable changed to Trainer\t2:19 PM ET
EOS
# allow leading/trailing whitespace, get the number, name, last reason and time
s =~ /\A\s*(#\d)\t+([^\t]+)(?:\t+.*)?(?:\t+(.*))\t+(\d+:\d+ (?:AM|PM) ET)\s*\Z/m
# ["#3", "Hello Stormy", "Scratch Reason - Reason Unavailable changed to Trainer", "2:19 PM ET"]
a = $1, $2, $3, $4
Note: this assumes only one message in the string that you're matching
Note: not tested for the single-line case :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.