简体   繁体   中英

Python re.findall regex and text processing

I'm looking to find and modify some sql syntax around the convert function. I want basically any convert(A,B) or CONVERT(A,B) in all my files to be selected and converted to B::A .

So far I tried selecting them with re.findall(r"\bconvert\b\(.*?,.*\)", l, re.IGNORECASE) But it's only returning a small selection out of what I want and I also have trouble actually manipulating the A/BI mentioned.

For example, a sample line (note the nested structure here is irrelevant, I'm only getting the outer layer working if possible)

convert(varchar, '/' || convert(nvarchar, es.Item_ID) || ':' || convert(nvarchar, o.Option_Number) || '/') as LocPath

...should become...

'/' || es.Item_ID::nvarchar || ':' || o.Option_Number::nvarchar || '/' :: varchar as LocPath

Example2:

SELECT LocationID AS ItemId, convert(bigint, -1),

...should become...

SELECT LocationID AS ItemId, -1::bigint,

I think this should be possible with some kind of re.sub with groups and currently have a code structure inside a for each loop where line is the each line in the file:

matchConvert = ["convert(", "CONVERT("]
a = next((a for a in matchConvert if a in line), False)
if a:
    print("convert() line")
    #line = re.sub(re.escape(a) + r'', '', line)

Edit: In the end I went with a non re solution and handled each line by identifying each block and manipulate them accordingly.

The task:

Swap the parameters of all the 'convert' functions in this given. Parameters can contain any character, including nested 'convert' functions.

A solution:

def convert_py(s):
    #capturing start:
    left=s.index('convert')
    start=s[:left]
    #capturing part_1:
    c=0
    line=''
    for n1,i in enumerate(s[left+8:],start=len(start)+8):
        if i==',' and c==0:
            part_1=line
            break
        if i==')':
            c-=1
        if i=='(':
            c+=1
        line+=i
    #capturing part_2:
    c=0
    line=''
    for n2,i in enumerate(s[n1+1:],start=n1+1):
        if i==')':
            c-=1
        if i=='(':
            c+=1
        if c<0:
            part_2=line
            break
        line+=i
    #capturing end:
    end=s[n2+1:]
    #capturing result:
    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_py(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_py(s)
    return s

Notes:

  • Unlike the solution based on the re module, which is presented in another answer - this version should not fail if there are more than two parameters in the 'convert' function in the given string. However, it will swap them only once, for example: convert(a,b, c) --> b, c : a
  • I am afraid that unforeseen cases may arise that will lead to failure. Please tell if you find any flaws

If i understood the task correctly, here is one solution:

import re

string="convert(varchar(max), "\
       "'/' || convert(nvarchar, es.Item_ID) "\
       "|| ':' || convert(nvarchar, o.Option_Number) || '/') "\
       "as LocPath"

start,mid_1,mid_2,end=re.search(r'''(\w+?\()
                                    (.+)(?<=\)),(.+)
                                    (\).*)''',string,re.X).groups()
result=start+mid_2.lstrip()+':: '+mid_1+end

start is the first group (\w+?\() with the name of the function and the opening '('

mid_1 is the second group (.+) , containing the first parameter

Then follows (?<=\)), which specifies the point ',' where the groups are divided

mid_2 is the third group (.+) with the second parameter

end is the fourth group (\).*) with the ') as LocPath' part

Then I merge the string, but swap mid_1 and mid_2. I put ':: ' between them and remove spaces on the left from mid_2.

The code works in this example, but there may still be unforeseen flaws when it comes to other examples. Please tell if you find any mistakes.

This may be an X/Y problem, meaning you're asking how to do something with Regex that may be better solved with parsing (meaning using/modifying/writing a SQL parser). An indication that this is the case is the fact that “convert” calls can be nested. I'm guessing Regex is going to be more of a headache than it's worth here in the long run if you're working with a lot of files and they're at all complicated.

Here's my solution based on @Иван-Балван 's code. Breaking this structure into blocks makes further specification a lot easier than I previously thought and I'll be using this method for a lot of other operations as well.

# Check for balanced brackets
def checkBracket(my_string):
    count = 0
    for c in my_string:
        if c == "(":
            count+=1
        elif c == ")":
            count-=1
    return count


# Modify the first convert in line
# Based on suggestions from stackoverflow.com/questions/73040953
def modifyConvert(l):
    # find the location of convert()
    count = l.index('convert(')

    # select the group before convert() call
    before = l[:count]

    group=""
    n1=0
    n2=0
    A=""
    B=""
    operate = False
    operators = ["|", "<", ">", "="]
    # look for A group before comma
    for n1, i in enumerate(l[count+8:], start=len(before)+8):
        # find current position in l
        checkIndex = checkBracket(l[count+8:][:n1-len(before)-8])
        if i == ',' and checkIndex == 0:
            A = group
            break
        group += i

    # look for B group after comma
    group = ""
    for n2, i in enumerate(l[n1+1:], start=n1+1):
        checkIndex = checkBracket(l[count+n1-len(before):][:n2-n1+1])
        if i == ',' and checkIndex == 0:
            return l
        elif checkIndex < 0:
            B = group
            break
        group += i
        
        # mark operators
        if i in operators:
            operate = True

    # select the group after convert() call
    after = l[n2+1:]

    # (B) if it contains operators
    if operate:
        return before + "(" + B.lstrip() + ') :: ' + A + after
    else:
        return before + B.lstrip() + '::' + A + after


# Modify cast syntax with convert(a,b). return line.
def convertCast(l):

    # Call helper for nested cases
    i = l.count('convert(')
    while i>0:
        i -= 1
        l = modifyConvert(l)

    return l

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM