简体   繁体   English

Python:如何在多行字符串中查找所有匹配项,但未按特定单词开头?

[英]Python: How to find all matches in a multiline string but not proceeded by particular word?

I have SQL codes and I would like to extract the table name after the "insert" keyword. 我有SQL代码,我想在“插入”关键字之后提取表名。

Basically, I would like to extract using the following rules: 基本上,我想使用以下规则进行提取:

  1. Contains the word "insert" 包含单词“插入”
  2. Followed by the word "into" which is optional 后跟单词“ into”(可选)
  3. Exclude if the there's a "--" (which is single line comment in SQL) anywhere before the insert into(optional) keyword. 排除在insert into(可选)关键字之前的任何位置是否有“-”(SQL中的单行注释)。
  4. Exclude if insert into(optional) keyword is between "/*" and "*/" (which is multiline comment in SQL). 排除insert into(可选)关键字是否在“ / *”和“ * /”之间(在SQL中为多行注释)。
  5. Get the next word (table_name) after insert into (optional) keyword 插入(可选)关键字后,获取下一个单词(table_name)

Example: 例:

import re

lines = """begin insert into table_1 end
    begin insert table_2 end   
    select 1 --This is will not insert into table_3
    begin insert into
        table_4
    end
    /* this is a comment
    insert into table_5
    */
    insert into table_6
    """

p = re.compile( r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE)
for m in re.finditer( p, lines ):
    line = lines[m.start(): m.end()].strip()

    starts_with_insert = re.findall('insert.*', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL)
    print re.compile('insert\s+(?:into\s+)?', flags=re.IGNORECASE|re.MULTILINE|re.DOTALL).split(' '.join(starts_with_insert))[1].split()[0]

Actual Result: 实际结果:

table_1
table_2
table_4
table_5
table_6

Expected Result: table_5 should not be returned since it's between /* and */ 预期结果:由于table_5在/ *和* /之间,因此不应返回

table_1
table_2
table_4
table_6

Is there an elegant way to do this? 有没有一种优雅的方法可以做到这一点?

Thanks in advance. 提前致谢。

EDIT : Thanks for your solutions. 编辑 :感谢您的解决方案。 Is it possible to use purely regex without stripping lines from original text? 是否可以使用纯正则表达式而不从原始文本中删除行?

I would like to display the line number where table name can be found from the original string. 我想显示可从原始字符串中找到表名的行号。

Updated code below: 更新后的代码如下:

import re

lines = """begin insert into table_1 end
    begin insert table_2 end   
    select 1 --This is will not insert into table_3
    begin insert into
        table_4
    end
    /* this is a comment
    insert into table_5
    */
    insert into table_6
    """

p = re.compile( r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE)
for m in re.finditer( p, lines ):
    line = lines[m.start(): m.end()].strip()
    line_no = str(lines.count("\n", 0, m.end()) + 1).zfill(6)

    table_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL)
    print '[line number: ' + line_no + '] ' + '; '.join(table_names)

Tried using lookahead/lookbehind to exclude those between /* and */ but it's not producing my expected result. 尝试使用lookahead / lookbehind排除/ *和* /之间的那些,但未产生我的预期结果。

Would appreciate your help. 感谢您的帮助。 Thanks! 谢谢!

In 2 steps with re.sub() and re.findall() functions: 使用re.sub()re.findall()函数re.findall()

# removing single line/multiline comments
stripped_lines = re.sub(r'/\*[\s\S]+\*/\s*|.*--.*(?=\binsert).*\n?', '', lines, re.S | re.I)

# extracting table names preceded by `insert` statement 
tbl_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', stripped_lines, re.I)
print(tbl_names)

The output: 输出:

['table_1', 'table_2', 'table_4', 'table_6']
import re
import string

lines = """begin insert into table_1 end
    begin insert table_2 end
    select 1 --This is will not insert into table_3
    begin insert into
        table_4
    end
    /* this is a comment
    insert into table_5
    */
    insert into table_6
    """

# remove all /* */ and -- comments
comments = re.compile('/\*(?:.*\n)+.*\*/|--.*?\n', flags=re.IGNORECASE | re.MULTILINE)
for comment in comments.findall(lines):
    lines = string.replace(lines, comment, '')

fullSet = re.compile('insert\s+(?:into\s+)*(\S+)', flags=re.IGNORECASE | re.MULTILINE)
print fullSet.findall(lines)

gives

['table_1', 'table_2', 'table_4', 'table_6']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM