解析单引号或双引号并允许使用正则表达式转义字符（在Python中）

Question

我的输入看起来像一个参数列表：

input1 = '''
title="My First Blog" author='John Doe'
'''

值可以用单引号或双引号括起来，但是，也允许转义：

input2 = '''
title='John\'s First Blog' author="John Doe"
'''

有没有办法使用正则表达式来提取会计单引号或双引号和转义引号的键值对？

使用python，我可以使用以下正则表达式并处理非转义引号：

rex = r"(\w+)\=(?P<quote>['\"])(.*?)(?P=quote)"

然后返回：

import re
re.findall(rex, input1)
[('title', '"', 'My First Blog'), ('author', "'", 'John Doe')]

和

import re
re.findall(rex, input2)
[('title', "'", 'John'), ('author', '"', 'John Doe')]

后者是不正确的。 我无法弄清楚如何处理转义引号 - 假设在（。*？）部分。 我一直在使用Python正则表达式的已发布答案中的解决方案来匹配单引号中的文本，忽略转义引号（和制表符/换行符）无济于事。

从技术上讲，我不需要findall来返回引号字符 - 而只需要键/值对 - 但这很容易处理。

任何帮助，将不胜感激！ 谢谢！

Answer 1

我认为蒂姆使用反向引用过度复杂化表达式（并在此猜测）也使得速度变慢。 标准方法（在owl书中使用）是分别匹配单引号和双引号字符串：

rx = r'''(?x)
    (\w+) = (
        ' (?: \\. | [^'] )* '
        |
        " (?: \\. | [^"] )* "
        |
        [^'"\s]+
    )
'''

添加一些后期处理，你很好：

input2 = r'''
title='John\'s First Blog' author="John Doe"
'''

data = {k:v.strip("\"\'").decode('string-escape') for k, v in re.findall(rx, input2)}
print data
# {'author': 'John Doe', 'title': "John's First Blog"}

作为奖励，这也匹配未加引号的属性，如weight=150 。

添加：这是一个没有正则表达式的清洁方式：

input2 = r'''
title='John\'s First Blog' author="John Doe"
'''

import shlex

lex = shlex.shlex(input2, posix=True)
lex.escapedquotes = '\"\''
lex.whitespace = ' \n\t='
for token in lex:
    print token

# title
# John's First Blog
# author
# John Doe

Answer 2

编辑

我的初始正则表达式解决方案有一个错误。 该错误掩盖了输入字符串中的错误： input2不是您认为的错误：

>>> input2 = '''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2      # See - the apostrophe is not correctly escaped!
'\ntitle=\'John\'s First Blog\' author="John Doe"\n'

你需要使input2成为一个原始字符串（或使用双反斜杠）：

>>> input2 = r'''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2
'\ntitle=\'John\\\'s First Blog\' author="John Doe"\n'

现在，您可以使用正确处理转义引号的正则表达式：

>>> rex = re.compile(
    r"""(\w+)# Match an identifier (group 1)
    =        # Match =
    (['"])   # Match an opening quote (group 2)
    (        # Match and capture into group 3:
     (?:     # the following regex:
      \\.    # Either an escaped character
     |       # or
      (?!\2) # (as long as we're not right at the matching quote)
      .      # any other character.
     )*      # Repeat as needed
    )        # End of capturing group
    \2       # Match the corresponding closing quote.""", 
    re.DOTALL | re.VERBOSE)
>>> rex.findall(input2)
[('title', "'", "John\\'s First Blog"), ('author', '"', 'John Doe')]

解析单引号或双引号并允许使用正则表达式转义字符（在Python中）

问题描述

2 个解决方案

解决方案1
5 2012-11-05 22:06:52

解决方案2
4 2012-11-05 20:57:54

解析单引号或双引号并允许使用正则表达式转义字符（在Python中）

问题描述

2 个解决方案

解决方案1 5 2012-11-05 22:06:52

解决方案2 4 2012-11-05 20:57:54

解决方案1
5 2012-11-05 22:06:52

解决方案2
4 2012-11-05 20:57:54