[英]Python RegEx nested search and replace
我需要进行RegEx搜索并替换在引号块内找到的所有逗号。
即
"thing1,blah","thing2,blah","thing3,blah",thing4
需要成为
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
我的代码:
inFile = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()
p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
pg = p.search(line)
# found comment block
if pg:
q = re.compile(r'[^\\],')
# found comma within comment block
qg = q.search(pg.group(0))
if qg:
# Here I want to reconstitute the line and print it with the replaced text
#print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))
我只需要基于RegEx过滤我想要的列,进一步过滤,
然后执行RegEx替换,然后重新构造该行。
如何在Python中执行此操作?
csv
模块非常适合解析此类数据,例如默认方言中的csv.reader
忽略带引号的逗号。 csv.writer
由于存在逗号而重新插入了引号。 我用StringIO
给接口提供了类似字符串的文件。
import csv
import StringIO
s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()
结果:
"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"
一般编辑
有
"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4
问题,现在不复存在了。
而且,我还没有评论r'[^\\\\],'
。
因此,我完全重写了我的答案。
"thing1,blah","thing2,blah","thing3,blah",thing4
和
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
显示字符串(我想)
import re
ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '
regx = re.compile('"[^"]*"')
def repl(mat, ri = re.compile('(?<!\\\\),') ):
return ri.sub('\\\\',mat.group())
print ss
print repr(ss)
print
print regx.sub(repl, ss)
print repr(regx.sub(repl, ss))
结果
"thing1,blah","thing2,blah","thing3\,blah",thing4
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '
"thing1\blah","thing2\blah","thing3\,blah",thing4
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '
您可以尝试此正则表达式。
>>> re.sub('(?<!"),(?!")', r"\\,",
'"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4
这背后的逻辑是替代一个,
与\\,
如果不是立即两者之前和之后一"
我想出了使用多个正则表达式函数的迭代解决方案:
finditer(),findall(),group(),start()和end()
有一种方法可以将所有这些转换成一个调用自身的递归函数。
有参加者吗?
outfile = open(outfileName,'w')
p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
pg = p.finditer(line)
pglen = len(p.findall(line))
if pglen > 0:
mpgstart = 0;
mpgend = 0;
for i,mpg in enumerate(pg):
if i == 0:
outfile.write(line[:mpg.start()])
qg = q.finditer(mpg.group(0))
qglen = len(q.findall(mpg.group(0)))
if i > 0 and i < pglen:
outfile.write(line[mpgend:mpg.start()])
if qglen > 0:
for j,mqg in enumerate(qg):
if j == 0:
outfile.write( mpg.group(0)[:mqg.start()] )
outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )
if j == (qglen-1):
outfile.write( mpg.group(0)[mqg.end():] )
else:
outfile.write(mpg.group(0))
if i == (pglen-1):
outfile.write(line[mpg.end():])
mpgstart = mpg.start()
mpgend = mpg.end()
else:
outfile.write(line)
outfile.close()
您是否研究过str.replace()?
str.replace(old,new [,count])返回字符串的副本,其中所有出现的子字符串old都被new替换。 如果给出了可选的参数count,则仅替换第一个出现的计数。
这是一些文档
希望这可以帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.