Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not escaped) - I want to match the text only. Example:
menu_item = 'casserole';
menu_item = 'meat
loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
gyro';
I want to grab only the text (and spaces), ignoring the tabs/newlines - and I don't actually care if the escaped quote appears in the results, as long as it doesn't affect the match:
casserole
meat loaf
Tonys magic pizza
hamburger
Daves famous pizza
Dave\'s lesser-known gyro # quote is okay if necessary.
I have manage to create a regex that almost does it - it handles the escaped quotes, but not the newlines:
menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'"
for line in inFP.readlines():
m = re.search(menuPat, line)
if m is not None:
print m.group()
There are definitely a ton of regular expression questions out there - but most are using Perl, and if there's one that does what I want, I couldn't figure it out :) And since I'm using Python, I don't care if it is spread across multiple groups, it's easy to recombine them.
Some Answers have said to just go with code for parsing the text. While I'm sure I could do that - I'm so close to having a working regex :) And it seems like it should be doable.
Update: I just realized that I am doing a Python readlines() to get each line, which obviously is breaking up the lines getting passed to the regex. I'm looking at re-writing it, but any suggestions on that part would also be very helpful.
This tested script should do the trick:
import re
re_sq_long = r"""
# Match single quoted string with escaped stuff.
' # Opening literal quote
( # $1: Capture string contents
[^'\\]* # Zero or more non-', non-backslash
(?: # "unroll-the-loop"!
\\. # Allow escaped anything.
[^'\\]* # Zero or more non-', non-backslash
)* # Finish {(special normal*)*} construct.
) # End $1: String contents.
' # Closing literal quote
"""
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
data = r'''
menu_item = 'casserole';
menu_item = 'meat
loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
match = re.sub('\s+', ' ', match) # Clean whitespace
match = re.sub(r'\\', '', match) # remove escapes
menu_items.append(match) # Add to menu list
print (menu_items)
Here is the short version of the regex:
'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'
This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition) ) for details.
Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):
'((?:[^'\\\\]|\\\\.)*)'
This should do it:
menu_item = '((?:[^'\\]|\\')*)'
Here the (?:[^'\\\\]|\\\\')*
part matches any sequence of any character except '
and \\
or a literal \\'
. The former expression [^'\\\\]
does also allow line breaks and tabulators that you then need to replace by a single space.
You cold try it like this:
pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)
It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.