用于解析注释配置文件的正则表达式

Question

编辑：我真的很好奇我是如何让这个正则表达式工作的。 请不要告诉我有更简单的方法。 这很明显！ ：P

我正在编写一个正则表达式（使用Python）来解析配置文件中的行。 线条看起来像这样：

someoption1 = some value # some comment
# this line is only a comment
someoption2 = some value with an escaped \# hash
someoption3 = some value with a \# hash # some comment

这个想法是哈希符号之后的任何东西都被认为是注释，除非哈希是用斜杠转义的。

我正在尝试使用正则表达式将每一行分成单独的部分：前导空格，赋值的左侧，赋值的右侧和注释。 对于示例中的第一行，细分将是：

空白：“”
作业左：“someoption1 =”
作业权：“有些价值”
评论“＃some comment”

这是我到目前为止的正则表达式：

^(\s)?(\S+\s?=)?(([^\#]*(\\\#)*)*)?(\#.*)?$

我对正则表达式很可怕，所以请随意撕开！

使用Python的re.findAll() ，这将返回：

第0个索引：空白，应该是
第一个索引：作业的左侧
第二个索引：赋值的右侧，直到第一个哈希，是否转义（这是不正确的）
第五个索引：第一个哈希，无论是否转义，以及之后的任何内容（这是不正确的）

我可能缺少一些关于正则表达式的基础知识。 如果有人能解决这个问题，我将永远感激...

Answer 1

正则表达式不匹配的原因是因为正则表达式的贪婪匹配行为：每个部分将匹配最长的子字符串，使得字符串的其余部分仍然可以与正则表达式的其余部分匹配

这意味着，如果你的一条线路被转义为＃，则：

[^\\#]* （没有必要转义#btw）将匹配第一个哈希之前的所有内容， 包括它之前的反斜杠
(\\\\\\#)*将不匹配任何内容，因为此时的字符串以＃开头
最终(\\#.*)将与字符串的其余部分匹配

一个简单的例子来强调这种可能不直观的行为：在正则表达式(a*)(ab)?(b*) ， (ab)? 绝不会匹配任何东西

我相信这个正则表达式（基于原始表达式）应该可以工作： ^\\s*(\\S+\\s*=([^\\\\#]|\\\\#?)*)?(#.*)?$

Answer 2

我会在多行模式下使用这个正则表达式：

^\s*([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*((?:[^\\#]|\\.)+)

这允许转义任何字符（ \\\\. ）。 如果您只想允许# ，请改用\\\\# 。

Answer 3

在迄今为止提出的5种解决方案中，只有Gumbo实际上有效。 这是我的解决方案，它也有效并且评论很多：

import re

def fn(line):
    match = re.search(
        r"""^          # Anchor to start of line
        (\s*)          # $1: Zero or more leading ws chars
        (?:            # Begin group for optional var=value.
          (\S+)        # $2: Variable name. One or more non-spaces.
          (\s*=\s*)    # $3: Assignment operator, optional ws
          (            # $4: Everything up to comment or EOL.
            [^#\\]*    # Unrolling the loop 1st normal*.
            (?:        # Begin (special normal*)* construct.
              \\.      # special is backslash-anything.
              [^#\\]*  # More normal*.
            )*         # End (special normal*)* construct.
          )            # End $4: Value.
        )?             # End group for optional var=value.
        ((?:\#.*)?)    # $5: Optional comment.
        $              # Anchor to end of line""", 
        line, re.MULTILINE | re.VERBOSE)
    return match.groups()

print (fn(r" # just a comment"))
print (fn(r" option1 = value"))
print (fn(r" option2 = value # no escape == IS a comment"))
print (fn(r" option3 = value \# 1 escape == NOT a comment"))
print (fn(r" option4 = value \\# 2 escapes == IS a comment"))
print (fn(r" option5 = value \\\# 3 escapes == NOT a comment"))
print (fn(r" option6 = value \\\\# 4 escapes == IS a comment"))

上面的脚本生成以下（正确的）输出:(使用Python 3.0.1测试）

(' ', None, None, None, '# just a comment')
(' ', 'option1', ' = ', 'value', '')
(' ', 'option2', ' = ', 'value ', '# no escape == IS a comment')
(' ', 'option3', ' = ', 'value \\# 1 escape == NOT a comment', '')
(' ', 'option4', ' = ', 'value \\\\', '# 2 escapes == IS a comment')
(' ', 'option5', ' = ', 'value \\\\\\# 3 escapes == NOT a comment', '')
(' ', 'option6', ' = ', 'value \\\\\\\\', '# 4 escapes == IS a comment')

请注意，此解决方案使用了Jeffrey Friedl的“展开循环效率技术（消除了慢速交替）。它根本不使用任何外观并且速度非常快。掌握正则表达式（第3版）是任何声称”知道“的人必读的内容正则表达式。（当我说“知道”时，我的意思是在Neo“ 我知道功夫！ ”意义:)

Answer 4

我已经就这个问题的目的留下了评论，但是假设这个问题纯粹是关于正则表达式，我仍然会给出答案。

假设你一次只处理一行输入，我会把它作为一个两通阶段。 这意味着你将有2个正则表达式。

(.*?(?<!\\\\))#(.*) ：首先分裂#不在\\前面（参见负向外观的文档）;
赋值语句表达式解析。

Answer 5

我根本不会使用正则表达式，因为我不会尝试使用热核弹头杀死苍蝇。

假设您一次只读一行，只需：

如果第一个字符是# ，则对整行设置注释并清空该行。
否则，找到第一次出现的#不在\\后立即，将注释设置为加上行的其余部分，并将行设置为之前的所有内容。
用\\#替换行中出现的所有\\# # 。

就是这样，你现在有一个正确的行和评论部分。 使用正则表达式一定分割新的行部分。

例如：

import re

def fn(line):
    # Split line into non-comment and comment.

    comment = ""
    if line[0] == "#":
        comment = line
        line = ""
    else:
        idx = re.search (r"[^\\]#", line)
        if idx != None:
            comment = line[idx.start()+1:]
            line = line[:idx.start()+1]

    # Split non-comment into key and value.

    idx = re.search (r"=", line)
    if idx == None:
        key = line
        val = ""
    else:
        key = line[:idx.start()]
        val = line[idx.start()+1:]
    val = val.replace ("\\#", "#")

    return (key.strip(),val.strip(),comment.strip())

print fn(r"someoption1 = some value # some comment")
print fn(r"# this line is only a comment")
print fn(r"someoption2 = some value with an escaped \# hash")
print fn(r"someoption3 = some value with a \# hash # some comment")

生产：

('someoption1', 'some value', '# some comment')
('', '', '# this line is only a comment')
('someoption2', 'some value with an escaped # hash', '')
('someoption3', 'some value with a # hash', '# some comment')

如果你必须使用正则表达式（违背我的建议），你的具体问题在于：

[^\#]

这（假设您的意思是正确转义的r"[^\\\\#]" ）将尝试匹配除\\或#之外的任何字符，而不是序列\\#如您所愿。 你可以使用负面的后台来做这件事，但我总是说，一旦正规表达式匆忙变得难以理解，最好还原到程序代码:-)

反思时，更好的方法是使用多级分割（因此正则表达式不必因处理缺少的字段而变得太可怕），如下所示：

def fn(line):
    line = line.strip()                            # remove spaces
    first = re.split (r"\s*(?<!\\)#\s*", line, 1)  # get non-comment/comment
    if len(first) == 1: first.append ("")          # ensure we have a comment
    first[0] = first[0].replace("\\#","#")         # unescape non-comment

    second = re.split (r"\s*=\s*", first[0], 1)    # get key and value
    if len(second) == 1: second.append ("")        # ensure we have a value
    second.append (first[1])                       # create 3-tuple
    return second                                  # and return it

这使用负前瞻来正确匹配注释分隔符，然后将非注释位分隔为键和值。 在这个空间中也可以正确处理空间，从而产生：

['someoption1', 'some value', 'some comment']
['', '', 'this line is only a comment']
['someoption2', 'some value with an escaped # hash', '']
['someoption3', 'some value with a # hash', 'some comment']

Answer 6

尝试将其分解为两个步骤：

转义处理以识别真实的注释（首先＃不以\\前提示（提示：“负向反馈”）），删除真实注释，然后用r"\\#"替换r"\\#" "#"
处理无评论的余数。

大提示：使用re.VERBOSE和评论

用于解析注释配置文件的正则表达式

问题描述

6 个解决方案

解决方案1
2 已采纳 2010-09-24 02:07:48

解决方案2
2 2010-09-24 05:49:11

解决方案3
2 2011-03-12 23:21:02

解决方案4
1 2010-09-24 01:48:34

解决方案5
0 2010-09-24 01:37:59

解决方案6
0 2010-09-24 01:48:56

用于解析注释配置文件的正则表达式

问题描述

6 个解决方案

解决方案1 2 已采纳 2010-09-24 02:07:48

解决方案2 2 2010-09-24 05:49:11

解决方案3 2 2011-03-12 23:21:02

解决方案4 1 2010-09-24 01:48:34

解决方案5 0 2010-09-24 01:37:59

解决方案6 0 2010-09-24 01:48:56

解决方案1
2 已采纳 2010-09-24 02:07:48

解决方案2
2 2010-09-24 05:49:11

解决方案3
2 2011-03-12 23:21:02

解决方案4
1 2010-09-24 01:48:34

解决方案5
0 2010-09-24 01:37:59

解决方案6
0 2010-09-24 01:48:56