简体   繁体   中英

Regex for matching Python multiline string with escaped characters

I'm parsing Python source code, and I've got regular expressions for single and double quoted strings (obtained by reading ridgerunner's answer to this thread ).

single_quote_re = "'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'";

double_quote_re = '"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"';

I'm trying to handle Python multiline strings now (three double-quotes).

s = '"""string one\'s end isn\'t here; \\""" it\'s here """ """string two here"""'
# correct output for findall should be:
#     ['string one\'s end isn\'t here; \\""" it\'s here ','string two here']

I tried messing around with it a bit, but still it's not right.

multiline_string_re = '"""([^(""")\\\\]*(?:\\\\.[^(""")\\\\]*)*)"""'

There's gotta be some way to say """ that isn't immediately preceded by a backslash (in other words, the first double-quote isn't escaped).

EDIT: I should be getting closer; I've tried the following:

r'(?<!\\)""".*(?<!\\)"""'
# Matches the entire string; not what I'm going for.

r'(?<!\\)"""[^((?<!\\)""")](?<!\\)"""'
# Matches that space between the two strings ('""" """') in the sample string s (see code above, prior to edit).

r'(?<!\\)"""([^((?<!\\)""")]*(?:\\.[^((?<!\\)""")]*)*)(?<!\\)"""'
# Same result as before, but with the triple quotes shaved off (' ').
# Note: I do indeed want the triple quotes excluded.

UPDATE: The solution, thanks to sln, appears to be """[^"\\\\] (?:(?:\\\\.|"")[^"\\\\] )*"""

 
 
 
 
  
  
  multiline_string_re = '"""[^"\\\\\\\\]*(?:(?:\\\\\\\\.|"")[^"\\\\\\\\]*)*"""' re.findall(multiline_string_re, s, re.DOTALL) # Result: # ['"""string one\\'s end isn\\'t here; \\\\""" it\\'s here """', '"""string two here"""']
 
 
  

The updated solution, thanks again to sln:

multiline_single_re = "'''[^'\\\\]*(?:(?:\\\\.|'{1,2}(?!'))[^'\\\\]*)*'''"
multiline_double_re = '"""[^"\\\\]*(?:(?:\\\\.|"{1,2}(?!"))[^"\\\\]*)*"""'

This snippet should match three quotes that have anything but a backslash before them.

[^\\]"""

You can integrate that into your regex.

Here is a test case using a regex in Perl. If you are going to allow escape
anything as well as escaped double quote form "", just modify one of the
regex's you've sited to allow for the double, double quote.

The source string is removed of single quote escaping.

 use strict;
 use warnings;

 $/ = undef;

 my $str = <DATA>;

 while ($str =~ /"[^"\\]*(?:(?:\\.|"")[^"\\]*)*"/sg )
 {
  print "found $&\n";

 }

  __DATA__

  """string one's end isn't here; \""" it's here """ """string two here"""

Output >>

 found """string one's end isn't here; \""" it's here """
 found """string two here"""

Note that for validity and error processing, the regex will need to contain
pass-through constructs (alternation) that can be processed in the body of the while loop.
Example /"[^"\\\\]*(?:(?:\\\\.|"")[^"\\\\]*)*"|(.)/sg ,
then
while(){
// if matched group 1, and its not a whitespace = possible error
}

Add - In reply to comments.

After some research on python block literals ,

it appears you have to handle not only escaped characters, but
up to 2 double quotes in the body. Ie. " or ""

To change the regex is simple. Add a 1-2 quantifier and restrain it with a lookahead assertion.

Below is the raw and string'd regex parts that you can pick and choose from.
Tested in Perl, it works.
Good Luck!

 # Raw - 
 #   (?s:
 #   """[^"\\]*(?:(?:\\.|"{1,2}(?!"))[^"\\]*)*"""
 #   |
 #   '''[^'\\]*(?:(?:\\.|'{1,2}(?!'))[^'\\]*)*'''
 #   )
 # String'd -
 #   '(?s:'
 #   '"""[^"\\\]*(?:(?:\\\.|"{1,2}(?!"))[^"\\\]*)*"""'
 #   '|'
 #   "'''[^'\\\\]*(?:(?:\\\\.|'{1,2}(?!'))[^'\\\\]*)*'''"
 #   ')'


 (?s:                # Dot-All
      # double quote literal block
      """                 # """ block open
      [^"\\]*             # 0 - many non " nor \
      (?:                 # Grp start
           (?:
                \\ .                # Escape anything
             |                      # or
                "{1,2}              # 1 - 2 "
                (?! " )             # Not followed by a "
           )
           [^"\\]*             # 0 - many non " nor \
      )*                  # Grp end, 0 - many times
      """                 # """ block close

   |                      # OR, 

      # single quote literal block
      '''                 # ''' block open
      [^'\\]*             # 0 - many non ' nor \
      (?:                 # Grp start
           (?:
                \\ .                # Escape anything
             |                      # or
                '{1,2}              # 1 - 2 '
                (?! ' )             # Not followed by a '
           )
           [^'\\]*             # 0 - many non ' nor \
      )*                  # Grp end, 0 - many times
      '''                 # ''' block close
 )

You can't parse Python source code with "simple" regexes.

The good news, however, is that the Python standard library comes with a full-fledged Python parser in the form of the ast module ( http://docs.python.org/2/library/ast.html ). Use that instead.

More specifically, the literal_eval function will parse literals (including all types of strings, and following escaping rules), and the parse function will parse arbitrary Python source code into an abstract syntax tree.

In addition, you should note that your example (s) actually parses to one string: 'string one\\'s end isn\\'t here; """ it\\'s here string two here' 'string one\\'s end isn\\'t here; """ it\\'s here string two here' , because in Python, adjacent string literals are concatenated at parse-time, like so:

>>> "a" "b" "c"
"abc"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM