简体   繁体   中英

Python regex search for lines starting with certain characters within substrings

I'd like to use regex to search for lines starting with certain characters within substrings. I have a SQL string -

qry = ''' 
with 
qry_1 as ( -- some text
   SELECT ID, 
          NAME
   FROM   ( ... other code...
),
qry_2 as ( 
    SELECT coalesce (table1.ID, table2.ID) as ID,
           NAME
   FROM (...other code...
),
qry_3 as (
-- some text
     SELECT id.WEATHER AS WEATHER_MORN,
            ROW_NUMBER() OVER(PARTITION BY id.SUN
                ORDER BY id.TIME) AS SUN_TIME,
            id.RAIN,
            id.MIST
   FROM (...other code..
-- some other text
)
'''

I'm able to extract subquery information through re.findall here -

sub = re.findall(r'' '(.+?) (?i)as \(',qry)

Where sub output is qry_1, qry_2, qry_3 And I'd like to be able to extract any lines starting with this character -- within those identified in sub . Something like this works for string values that I got help with here -

# search substring between strings 
params = [re.findall('^\w+|(?:--)|(?<=\.)(?:--)', i) 
     for i in re.findall('\w+\s(?i)as\s\([\s\w\.,\n]+', qry)]
dict_result = {a:None if not b else b for a, *b in params}

dict_result = dict([(k,dict_result[k]) for k in sub])
dict_result

But how to incorporate the starts with special character -- ? So the output is like this -

{'qry_1' : 'some text', 'qry_2': 'None', 'qry_3': 'some text, some other text'}

Thank you for guidance here

For the example data, one option could be using a capture group for all the parts before as ( in group 1, and capture all lines after it in group 2 that do not contain as ( .

^(.+?) as \((.*(?:\n(?!.* as \().*)*)\n\)
  • ^ Start of string
  • (.+?) Capture group 1
  • as \( Match as (
  • ( Capture group 2
    • .* Match the rest of the line
    • (?:\n(?..* as \().*)*
  • ) Close group 1
  • \n\) Match a newline and )

Then you could use group 1 as the key of the dict, and use re.findall using the value of group 2 to find the strings that start with -- and capture what follows that again in a capture group, which will be returned by re.findall.

import re

regex = r"^(.+?) as \((.*(?:\n(?!.* as \().*)*)\n\)"
dict_result = {}
s = "the example string here"

for tup in re.findall(regex, s, re.MULTILINE):
    matches = re.findall(r"-- (.*)", tup[1])
    dict_result[tup[0]] = matches if len(matches) > 0 else None

print(dict_result)

Output

{'qry_1': ['some text'], 'qry_2': None, 'qry_3': ['some text', 'some other text']}

Regex demo | Python demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM