用於解析注釋配置文件的正則表達式

Question

編輯：我真的很好奇我是如何讓這個正則表達式工作的。 請不要告訴我有更簡單的方法。 這很明顯！ ：P

我正在編寫一個正則表達式（使用Python）來解析配置文件中的行。 線條看起來像這樣：

someoption1 = some value # some comment
# this line is only a comment
someoption2 = some value with an escaped \# hash
someoption3 = some value with a \# hash # some comment

這個想法是哈希符號之后的任何東西都被認為是注釋，除非哈希是用斜杠轉義的。

我正在嘗試使用正則表達式將每一行分成單獨的部分：前導空格，賦值的左側，賦值的右側和注釋。 對於示例中的第一行，細分將是：

空白：“”
作業左：“someoption1 =”
作業權：“有些價值”
評論“＃some comment”

這是我到目前為止的正則表達式：

^(\s)?(\S+\s?=)?(([^\#]*(\\\#)*)*)?(\#.*)?$

我對正則表達式很可怕，所以請隨意撕開！

使用Python的re.findAll() ，這將返回：

第0個索引：空白，應該是
第一個索引：作業的左側
第二個索引：賦值的右側，直到第一個哈希，是否轉義（這是不正確的）
第五個索引：第一個哈希，無論是否轉義，以及之后的任何內容（這是不正確的）

我可能缺少一些關於正則表達式的基礎知識。 如果有人能解決這個問題，我將永遠感激...

Answer 1

正則表達式不匹配的原因是因為正則表達式的貪婪匹配行為：每個部分將匹配最長的子字符串，使得字符串的其余部分仍然可以與正則表達式的其余部分匹配

這意味着，如果你的一條線路被轉義為＃，則：

[^\\#]* （沒有必要轉義#btw）將匹配第一個哈希之前的所有內容， 包括它之前的反斜杠
(\\\\\\#)*將不匹配任何內容，因為此時的字符串以＃開頭
最終(\\#.*)將與字符串的其余部分匹配

一個簡單的例子來強調這種可能不直觀的行為：在正則表達式(a*)(ab)?(b*) ， (ab)? 絕不會匹配任何東西

我相信這個正則表達式（基於原始表達式）應該可以工作： ^\\s*(\\S+\\s*=([^\\\\#]|\\\\#?)*)?(#.*)?$

Answer 2

我會在多行模式下使用這個正則表達式：

^\s*([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*((?:[^\\#]|\\.)+)

這允許轉義任何字符（ \\\\. ）。 如果您只想允許# ，請改用\\\\# 。

Answer 3

在迄今為止提出的5種解決方案中，只有Gumbo實際上有效。 這是我的解決方案，它也有效並且評論很多：

import re

def fn(line):
    match = re.search(
        r"""^          # Anchor to start of line
        (\s*)          # $1: Zero or more leading ws chars
        (?:            # Begin group for optional var=value.
          (\S+)        # $2: Variable name. One or more non-spaces.
          (\s*=\s*)    # $3: Assignment operator, optional ws
          (            # $4: Everything up to comment or EOL.
            [^#\\]*    # Unrolling the loop 1st normal*.
            (?:        # Begin (special normal*)* construct.
              \\.      # special is backslash-anything.
              [^#\\]*  # More normal*.
            )*         # End (special normal*)* construct.
          )            # End $4: Value.
        )?             # End group for optional var=value.
        ((?:\#.*)?)    # $5: Optional comment.
        $              # Anchor to end of line""", 
        line, re.MULTILINE | re.VERBOSE)
    return match.groups()

print (fn(r" # just a comment"))
print (fn(r" option1 = value"))
print (fn(r" option2 = value # no escape == IS a comment"))
print (fn(r" option3 = value \# 1 escape == NOT a comment"))
print (fn(r" option4 = value \\# 2 escapes == IS a comment"))
print (fn(r" option5 = value \\\# 3 escapes == NOT a comment"))
print (fn(r" option6 = value \\\\# 4 escapes == IS a comment"))

上面的腳本生成以下（正確的）輸出:(使用Python 3.0.1測試）

(' ', None, None, None, '# just a comment')
(' ', 'option1', ' = ', 'value', '')
(' ', 'option2', ' = ', 'value ', '# no escape == IS a comment')
(' ', 'option3', ' = ', 'value \\# 1 escape == NOT a comment', '')
(' ', 'option4', ' = ', 'value \\\\', '# 2 escapes == IS a comment')
(' ', 'option5', ' = ', 'value \\\\\\# 3 escapes == NOT a comment', '')
(' ', 'option6', ' = ', 'value \\\\\\\\', '# 4 escapes == IS a comment')

請注意，此解決方案使用了Jeffrey Friedl的“展開循環效率技術（消除了慢速交替）。它根本不使用任何外觀並且速度非常快。掌握正則表達式（第3版）是任何聲稱”知道“的人必讀的內容正則表達式。（當我說“知道”時，我的意思是在Neo“ 我知道功夫！ ”意義:)

Answer 4

我已經就這個問題的目的留下了評論，但是假設這個問題純粹是關於正則表達式，我仍然會給出答案。

假設你一次只處理一行輸入，我會把它作為一個兩通階段。 這意味着你將有2個正則表達式。

(.*?(?<!\\\\))#(.*) ：首先分裂#不在\\前面（參見負向外觀的文檔）;
賦值語句表達式解析。

Answer 5

我根本不會使用正則表達式，因為我不會嘗試使用熱核彈頭殺死蒼蠅。

假設您一次只讀一行，只需：

如果第一個字符是# ，則對整行設置注釋並清空該行。
否則，找到第一次出現的#不在\\后立即，將注釋設置為加上行的其余部分，並將行設置為之前的所有內容。
用\\#替換行中出現的所有\\# # 。

就是這樣，你現在有一個正確的行和評論部分。 使用正則表達式一定分割新的行部分。

例如：

import re

def fn(line):
    # Split line into non-comment and comment.

    comment = ""
    if line[0] == "#":
        comment = line
        line = ""
    else:
        idx = re.search (r"[^\\]#", line)
        if idx != None:
            comment = line[idx.start()+1:]
            line = line[:idx.start()+1]

    # Split non-comment into key and value.

    idx = re.search (r"=", line)
    if idx == None:
        key = line
        val = ""
    else:
        key = line[:idx.start()]
        val = line[idx.start()+1:]
    val = val.replace ("\\#", "#")

    return (key.strip(),val.strip(),comment.strip())

print fn(r"someoption1 = some value # some comment")
print fn(r"# this line is only a comment")
print fn(r"someoption2 = some value with an escaped \# hash")
print fn(r"someoption3 = some value with a \# hash # some comment")

生產：

('someoption1', 'some value', '# some comment')
('', '', '# this line is only a comment')
('someoption2', 'some value with an escaped # hash', '')
('someoption3', 'some value with a # hash', '# some comment')

如果你必須使用正則表達式（違背我的建議），你的具體問題在於：

[^\#]

這（假設您的意思是正確轉義的r"[^\\\\#]" ）將嘗試匹配除\\或#之外的任何字符，而不是序列\\#如您所願。 你可以使用負面的后台來做這件事，但我總是說，一旦正規表達式匆忙變得難以理解，最好還原到程序代碼:-)

反思時，更好的方法是使用多級分割（因此正則表達式不必因處理缺少的字段而變得太可怕），如下所示：

def fn(line):
    line = line.strip()                            # remove spaces
    first = re.split (r"\s*(?<!\\)#\s*", line, 1)  # get non-comment/comment
    if len(first) == 1: first.append ("")          # ensure we have a comment
    first[0] = first[0].replace("\\#","#")         # unescape non-comment

    second = re.split (r"\s*=\s*", first[0], 1)    # get key and value
    if len(second) == 1: second.append ("")        # ensure we have a value
    second.append (first[1])                       # create 3-tuple
    return second                                  # and return it

這使用負前瞻來正確匹配注釋分隔符，然后將非注釋位分隔為鍵和值。 在這個空間中也可以正確處理空間，從而產生：

['someoption1', 'some value', 'some comment']
['', '', 'this line is only a comment']
['someoption2', 'some value with an escaped # hash', '']
['someoption3', 'some value with a # hash', 'some comment']

Answer 6

嘗試將其分解為兩個步驟：

轉義處理以識別真實的注釋（首先＃不以\\前提示（提示：“負向反饋”）），刪除真實注釋，然后用r"\\#"替換r"\\#" "#"
處理無評論的余數。

大提示：使用re.VERBOSE和評論

用於解析注釋配置文件的正則表達式

問題描述

6 個解決方案

解決方案1
2 已采納 2010-09-24 02:07:48

解決方案2
2 2010-09-24 05:49:11

解決方案3
2 2011-03-12 23:21:02

解決方案4
1 2010-09-24 01:48:34

解決方案5
0 2010-09-24 01:37:59

解決方案6
0 2010-09-24 01:48:56

用於解析注釋配置文件的正則表達式

問題描述

6 個解決方案

解決方案1 2 已采納 2010-09-24 02:07:48

解決方案2 2 2010-09-24 05:49:11

解決方案3 2 2011-03-12 23:21:02

解決方案4 1 2010-09-24 01:48:34

解決方案5 0 2010-09-24 01:37:59

解決方案6 0 2010-09-24 01:48:56

解決方案1
2 已采納 2010-09-24 02:07:48

解決方案2
2 2010-09-24 05:49:11

解決方案3
2 2011-03-12 23:21:02

解決方案4
1 2010-09-24 01:48:34

解決方案5
0 2010-09-24 01:37:59

解決方案6
0 2010-09-24 01:48:56