简体   繁体   English

Python RegEx嵌套搜索和替换

[英]Python RegEx nested search and replace

I need to to a RegEx search and replace of all commas found inside of quote blocks. 我需要进行RegEx搜索并替换在引号块内找到的所有逗号。
ie

"thing1,blah","thing2,blah","thing3,blah",thing4  

needs to become 需要成为

"thing1\,blah","thing2\,blah","thing3\,blah",thing4  

my code: 我的代码:

inFile  = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()

p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
    pg = p.search(line)
    # found comment block
    if pg:
        q  = re.compile(r'[^\\],')
        # found comma within comment block
        qg = q.search(pg.group(0))
        if qg:
            # Here I want to reconstitute the line and print it with the replaced text
            #print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))

I need to filter only the columns I want based on a RegEx, filter further, 我只需要基于RegEx过滤我想要的列,进一步过滤,
then do the RegEx replace, then reconstitute the line back. 然后执行RegEx替换,然后重新构造该行。

How can I do this in Python? 如何在Python中执行此操作?

The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv模块非常适合解析此类数据,例如默认方言中的csv.reader忽略带引号的逗号。 csv.writer reinserts the quotes due to the presence of commas. csv.writer由于存在逗号而重新插入了引号。 I used StringIO to give a file like interface to a string. 我用StringIO给接口提供了类似字符串的文件。

import csv
import StringIO

s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
    wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()

result: 结果:

"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"

General Edit 一般编辑

There was

"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4   

in the question, and now it is not there anymore. 问题,现在不复存在了。

Moreover, I hadn't remarked r'[^\\\\],' . 而且,我还没有评论r'[^\\\\],'

So, I completely rewrite my answer. 因此,我完全重写了我的答案。

"thing1,blah","thing2,blah","thing3,blah",thing4               

and

"thing1\,blah","thing2\,blah","thing3\,blah",thing4

being displays of strings (I suppose) 显示字符串(我想)

import re


ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '

regx = re.compile('"[^"]*"')

def repl(mat, ri = re.compile('(?<!\\\\),') ):
    return ri.sub('\\\\',mat.group())

print ss
print repr(ss)
print
print      regx.sub(repl, ss)
print repr(regx.sub(repl, ss))

result 结果

"thing1,blah","thing2,blah","thing3\,blah",thing4 
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '

"thing1\blah","thing2\blah","thing3\,blah",thing4 
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '

You can try this regex. 您可以尝试此正则表达式。


>>> re.sub('(?<!"),(?!")', r"\\,", 
                     '"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4

The logic behind this is to substitute a , with \\, if it is not immediately both preceded and followed by a " 这背后的逻辑是替代一个,\\,如果不是立即两者之前和之后一"

I came up with an iterative solution using several regex functions: 我想出了使用多个正则表达式函数的迭代解决方案:
finditer(), findall(), group(), start() and end() finditer(),findall(),group(),start()和end()
There's a way to turn all this into a recursive function that calls itself. 有一种方法可以将所有这些转换成一个调用自身的递归函数。
Any takers? 有参加者吗?

outfile  = open(outfileName,'w')

p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
    pg = p.finditer(line)
    pglen = len(p.findall(line))

    if pglen > 0:
        mpgstart = 0;
        mpgend   = 0;

        for i,mpg in enumerate(pg):
            if i == 0:
                outfile.write(line[:mpg.start()])

            qg    = q.finditer(mpg.group(0))
            qglen = len(q.findall(mpg.group(0)))

            if i > 0 and i < pglen:
                outfile.write(line[mpgend:mpg.start()])

            if qglen > 0:
                for j,mqg in enumerate(qg):
                    if j == 0:
                        outfile.write( mpg.group(0)[:mqg.start()]    )

                    outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )

                    if j == (qglen-1):
                        outfile.write( mpg.group(0)[mqg.end():]      )
            else:
                outfile.write(mpg.group(0))

            if i == (pglen-1):
                outfile.write(line[mpg.end():])

            mpgstart = mpg.start()
            mpgend   = mpg.end()
    else:
        outfile.write(line)

outfile.close()

have you looked into str.replace()? 您是否研究过str.replace()?

str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. str.replace(old,new [,count])返回字符串的副本,其中所有出现的子字符串old都被new替换。 If the optional argument count is given, only the first count occurrences are replaced. 如果给出了可选的参数count,则仅替换第一个出现的计数。

here is some documentation 是一些文档

hope this helps 希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM