简体   繁体   中英

Python RegEx nested search and replace

I need to to a RegEx search and replace of all commas found inside of quote blocks.
ie

"thing1,blah","thing2,blah","thing3,blah",thing4  

needs to become

"thing1\,blah","thing2\,blah","thing3\,blah",thing4  

my code:

inFile  = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()

p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
    pg = p.search(line)
    # found comment block
    if pg:
        q  = re.compile(r'[^\\],')
        # found comma within comment block
        qg = q.search(pg.group(0))
        if qg:
            # Here I want to reconstitute the line and print it with the replaced text
            #print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))

I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.

How can I do this in Python?

The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.

import csv
import StringIO

s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
    wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()

result:

"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"

General Edit

There was

"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4   

in the question, and now it is not there anymore.

Moreover, I hadn't remarked r'[^\\\\],' .

So, I completely rewrite my answer.

"thing1,blah","thing2,blah","thing3,blah",thing4               

and

"thing1\,blah","thing2\,blah","thing3\,blah",thing4

being displays of strings (I suppose)

import re


ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '

regx = re.compile('"[^"]*"')

def repl(mat, ri = re.compile('(?<!\\\\),') ):
    return ri.sub('\\\\',mat.group())

print ss
print repr(ss)
print
print      regx.sub(repl, ss)
print repr(regx.sub(repl, ss))

result

"thing1,blah","thing2,blah","thing3\,blah",thing4 
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '

"thing1\blah","thing2\blah","thing3\,blah",thing4 
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '

You can try this regex.


>>> re.sub('(?<!"),(?!")', r"\\,", 
                     '"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4

The logic behind this is to substitute a , with \\, if it is not immediately both preceded and followed by a "

I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?

outfile  = open(outfileName,'w')

p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
    pg = p.finditer(line)
    pglen = len(p.findall(line))

    if pglen > 0:
        mpgstart = 0;
        mpgend   = 0;

        for i,mpg in enumerate(pg):
            if i == 0:
                outfile.write(line[:mpg.start()])

            qg    = q.finditer(mpg.group(0))
            qglen = len(q.findall(mpg.group(0)))

            if i > 0 and i < pglen:
                outfile.write(line[mpgend:mpg.start()])

            if qglen > 0:
                for j,mqg in enumerate(qg):
                    if j == 0:
                        outfile.write( mpg.group(0)[:mqg.start()]    )

                    outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )

                    if j == (qglen-1):
                        outfile.write( mpg.group(0)[mqg.end():]      )
            else:
                outfile.write(mpg.group(0))

            if i == (pglen-1):
                outfile.write(line[mpg.end():])

            mpgstart = mpg.start()
            mpgend   = mpg.end()
    else:
        outfile.write(line)

outfile.close()

have you looked into str.replace()?

str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

here is some documentation

hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM