Python：如何在多行字符串中查找所有匹配項，但未按特定單詞開頭？

Question

我有SQL代碼，我想在“插入”關鍵字之后提取表名。

基本上，我想使用以下規則進行提取：

包含單詞“插入”
后跟單詞“ into”（可選）
排除在insert into（可選）關鍵字之前的任何位置是否有“-”（SQL中的單行注釋）。
排除insert into（可選）關鍵字是否在“ / *”和“ * /”之間（在SQL中為多行注釋）。
插入（可選）關鍵字后，獲取下一個單詞（table_name）

例：

import re

lines = """begin insert into table_1 end
    begin insert table_2 end   
    select 1 --This is will not insert into table_3
    begin insert into
        table_4
    end
    /* this is a comment
    insert into table_5
    */
    insert into table_6
    """

p = re.compile( r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE)
for m in re.finditer( p, lines ):
    line = lines[m.start(): m.end()].strip()

    starts_with_insert = re.findall('insert.*', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL)
    print re.compile('insert\s+(?:into\s+)?', flags=re.IGNORECASE|re.MULTILINE|re.DOTALL).split(' '.join(starts_with_insert))[1].split()[0]

實際結果：

table_1
table_2
table_4
table_5
table_6

預期結果：由於table_5在/ *和* /之間，因此不應返回

table_1
table_2
table_4
table_6

有沒有一種優雅的方法可以做到這一點？

提前致謝。

編輯：感謝您的解決方案。 是否可以使用純正則表達式而不從原始文本中刪除行？

我想顯示可從原始字符串中找到表名的行號。

更新后的代碼如下：

import re

lines = """begin insert into table_1 end
    begin insert table_2 end   
    select 1 --This is will not insert into table_3
    begin insert into
        table_4
    end
    /* this is a comment
    insert into table_5
    */
    insert into table_6
    """

p = re.compile( r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE)
for m in re.finditer( p, lines ):
    line = lines[m.start(): m.end()].strip()
    line_no = str(lines.count("\n", 0, m.end()) + 1).zfill(6)

    table_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL)
    print '[line number: ' + line_no + '] ' + '; '.join(table_names)

嘗試使用lookahead / lookbehind排除/ *和* /之間的那些，但未產生我的預期結果。

感謝您的幫助。 謝謝！

Answer 1

使用re.sub()和re.findall()函數re.findall() ：

# removing single line/multiline comments
stripped_lines = re.sub(r'/\*[\s\S]+\*/\s*|.*--.*(?=\binsert).*\n?', '', lines, re.S | re.I)

# extracting table names preceded by `insert` statement 
tbl_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', stripped_lines, re.I)
print(tbl_names)

輸出：

['table_1', 'table_2', 'table_4', 'table_6']

Answer 2

import re
import string

lines = """begin insert into table_1 end
    begin insert table_2 end
    select 1 --This is will not insert into table_3
    begin insert into
        table_4
    end
    /* this is a comment
    insert into table_5
    */
    insert into table_6
    """

# remove all /* */ and -- comments
comments = re.compile('/\*(?:.*\n)+.*\*/|--.*?\n', flags=re.IGNORECASE | re.MULTILINE)
for comment in comments.findall(lines):
    lines = string.replace(lines, comment, '')

fullSet = re.compile('insert\s+(?:into\s+)*(\S+)', flags=re.IGNORECASE | re.MULTILINE)
print fullSet.findall(lines)

給

['table_1', 'table_2', 'table_4', 'table_6']

Python：如何在多行字符串中查找所有匹配項，但未按特定單詞開頭？

問題描述

2 個解決方案

解決方案1
0 2017-10-05 11:11:24

解決方案2
0 2017-10-05 12:35:51

Python：如何在多行字符串中查找所有匹配項，但未按特定單詞開頭？

問題描述

2 個解決方案

解決方案1 0 2017-10-05 11:11:24

解決方案2 0 2017-10-05 12:35:51

解決方案1
0 2017-10-05 11:11:24

解決方案2
0 2017-10-05 12:35:51