[英]Parsing single or double quotes and allow for escaped characters using regular expressions (in Python)
I have input that looks like a list of arguments: 我的输入看起来像一个参数列表:
input1 = '''
title="My First Blog" author='John Doe'
'''
The values can be surrounded by single or double quotes, however, escaping is also allowed: 值可以用单引号或双引号括起来,但是,也允许转义:
input2 = '''
title='John\'s First Blog' author="John Doe"
'''
Is there a way to use regular expressions to extract the key value pairs accounting for either single or double quotes and escaped quotes? 有没有办法使用正则表达式来提取会计单引号或双引号和转义引号的键值对?
Using python, I can use the following regular expression and handle the non-escaped quotes: 使用python,我可以使用以下正则表达式并处理非转义引号:
rex = r"(\w+)\=(?P<quote>['\"])(.*?)(?P=quote)"
The returns are then: 然后返回:
import re
re.findall(rex, input1)
[('title', '"', 'My First Blog'), ('author', "'", 'John Doe')]
and 和
import re
re.findall(rex, input2)
[('title', "'", 'John'), ('author', '"', 'John Doe')]
The latter being incorrect. 后者是不正确的。 I can't figure out how to handle escaped quotes--assumedly in the (.*?) section.
我无法弄清楚如何处理转义引号 - 假设在(。*?)部分。 I've been working with the solution in the posted answers on Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines) to no avail.
我一直在使用Python正则表达式的已发布答案中的解决方案来匹配单引号中的文本,忽略转义引号(和制表符/换行符)无济于事。
Technically, I don't need findall to return the quote character--rather just the key/value pairs--but that is easily dealt with. 从技术上讲,我不需要findall来返回引号字符 - 而只需要键/值对 - 但这很容易处理。
Any help would be appreciated! 任何帮助,将不胜感激! Thanks!
谢谢!
I think Tim's use of backreferences overcomplicates the expression and (guessing here) also makes it slower. 我认为蒂姆使用反向引用过度复杂化表达式(并在此猜测)也使得速度变慢。 The standard approach (used in the owl book) is to match single- and double-quoted strings separately:
标准方法(在owl书中使用)是分别匹配单引号和双引号字符串:
rx = r'''(?x)
(\w+) = (
' (?: \\. | [^'] )* '
|
" (?: \\. | [^"] )* "
|
[^'"\s]+
)
'''
Add a bit of postprocessing and you're fine: 添加一些后期处理,你很好:
input2 = r'''
title='John\'s First Blog' author="John Doe"
'''
data = {k:v.strip("\"\'").decode('string-escape') for k, v in re.findall(rx, input2)}
print data
# {'author': 'John Doe', 'title': "John's First Blog"}
As a bonus, this also matches unquoted attributes like weight=150
. 作为奖励,这也匹配未加引号的属性,如
weight=150
。
Add: here's a cleaner way without regular expressions: 添加:这是一个没有正则表达式的清洁方式:
input2 = r'''
title='John\'s First Blog' author="John Doe"
'''
import shlex
lex = shlex.shlex(input2, posix=True)
lex.escapedquotes = '\"\''
lex.whitespace = ' \n\t='
for token in lex:
print token
# title
# John's First Blog
# author
# John Doe
EDIT 编辑
My inital regex solution had a bug in it. 我的初始正则表达式解决方案有一个错误。 That bug masked an error in your input string:
input2
is not what you think it is: 该错误掩盖了输入字符串中的错误:
input2
不是您认为的错误:
>>> input2 = '''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2 # See - the apostrophe is not correctly escaped!
'\ntitle=\'John\'s First Blog\' author="John Doe"\n'
You need to make input2
a raw string (or use double backslashes): 你需要使
input2
成为一个原始字符串(或使用双反斜杠):
>>> input2 = r'''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2
'\ntitle=\'John\\\'s First Blog\' author="John Doe"\n'
Now you can use a regex that handles escaped quotes correctly: 现在,您可以使用正确处理转义引号的正则表达式:
>>> rex = re.compile(
r"""(\w+)# Match an identifier (group 1)
= # Match =
(['"]) # Match an opening quote (group 2)
( # Match and capture into group 3:
(?: # the following regex:
\\. # Either an escaped character
| # or
(?!\2) # (as long as we're not right at the matching quote)
. # any other character.
)* # Repeat as needed
) # End of capturing group
\2 # Match the corresponding closing quote.""",
re.DOTALL | re.VERBOSE)
>>> rex.findall(input2)
[('title', "'", "John\\'s First Blog"), ('author', '"', 'John Doe')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.