简体   繁体   中英

Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not escaped) - I want to match the text only. Example:

menu_item = 'casserole';
menu_item = 'meat 
            loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
    gyro';

I want to grab only the text (and spaces), ignoring the tabs/newlines - and I don't actually care if the escaped quote appears in the results, as long as it doesn't affect the match:

casserole
meat loaf
Tonys magic pizza
hamburger
Daves famous pizza
Dave\'s lesser-known gyro # quote is okay if necessary.

I have manage to create a regex that almost does it - it handles the escaped quotes, but not the newlines:

menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'"
for line in inFP.readlines():
    m = re.search(menuPat, line)
    if m is not None:
        print m.group()

There are definitely a ton of regular expression questions out there - but most are using Perl, and if there's one that does what I want, I couldn't figure it out :) And since I'm using Python, I don't care if it is spread across multiple groups, it's easy to recombine them.

Some Answers have said to just go with code for parsing the text. While I'm sure I could do that - I'm so close to having a working regex :) And it seems like it should be doable.

Update: I just realized that I am doing a Python readlines() to get each line, which obviously is breaking up the lines getting passed to the regex. I'm looking at re-writing it, but any suggestions on that part would also be very helpful.

This tested script should do the trick:

import re
re_sq_long = r"""
    # Match single quoted string with escaped stuff.
    '            # Opening literal quote
    (            # $1: Capture string contents
      [^'\\]*    # Zero or more non-', non-backslash
      (?:        # "unroll-the-loop"!
        \\.      # Allow escaped anything.
        [^'\\]*  # Zero or more non-', non-backslash
      )*         # Finish {(special normal*)*} construct.
    )            # End $1: String contents.
    '            # Closing literal quote
    """
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"

data = r'''
        menu_item = 'casserole';
        menu_item = 'meat 
                    loaf';
        menu_item = 'Tony\'s magic pizza';
        menu_item = 'hamburger';
        menu_item = 'Dave\'s famous pizza';
        menu_item = 'Dave\'s lesser-known
            gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
    match = re.sub('\s+', ' ', match) # Clean whitespace
    match = re.sub(r'\\', '', match)  # remove escapes
    menu_items.append(match)          # Add to menu list

print (menu_items)

Here is the short version of the regex:

'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'

This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition) ) for details.

Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

'((?:[^'\\\\]|\\\\.)*)'

This should do it:

menu_item = '((?:[^'\\]|\\')*)'

Here the (?:[^'\\\\]|\\\\')* part matches any sequence of any character except ' and \\ or a literal \\' . The former expression [^'\\\\] does also allow line breaks and tabulators that you then need to replace by a single space.

You cold try it like this:

pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)

It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM