[英]string searching with returning matched line in python
我是 python 的新手。我想在文件的某些行中匹配字符串。假设我有字符串:
british 7
German 8
France 90
我在文件中有一些行,如下所示:
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>
我想像这样得到 output :
<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>
我尝试使用以下代码:
for i in file:
if left in i and right in i:
line = i.replace(left, '<w1>' + left + '</w1>')
lineR = line.replace(right, '<w2>' + right + '</w2>')
text = text + lineR + "\n"
continue
return text
但是,它也匹配来自 id.eg 的字符串。
<s id="69-<w2>7</w2>">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>
那么,有没有办法将字符串搜索为单词而不是字符,以便我可以转义<s id="69-<w2>7</w2>">
?
提前感谢您的任何帮助。
我有一些相当复杂的东西,但我写的很匆忙,目前它完成了这项工作。
注意:
我在英国流行乐队 10cc 的录音室 7 专辑之后添加了“在法国”
并且只有英国人被修改
由德国乐队 Genesis 8 于 1978 年发布的“1978 ”没有修改,而“8”则经过修改。
这就是它复杂的原因。
但我担心,尽管有这种复杂性,但它并不适用于所有可能的句子。
应该进行改进以使idi始终是正确的音乐组的名称,而不是像本解决方案中那样始终是第一个。 但在不知道自己到底想要什么的情况下,这是一项艰苦的工作
ss ='''british 7
German 8
France 90'''
text = '''<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>
'''
import re
regx = re.compile('^(.+?)[ \t]+(\d+)',re.MULTILINE)
dico = dict((a.lower(),b) for (a,b) in regx.findall(ss))
print 'dico==',dico
print '\n\n'
rogx = re.compile('(<s id="[\d-]+">|</s>\r?\n)')
splitted = rogx.split(text)
print 'splitted==\n',splitted
print '=================\n'
def repl(mat):
idi = (b for (a,b) in the if b).next().lower()
x,y = mat.groups()
if x:
if dico[idi.lower()]==x:
return '<w2>%s</w2>' % x
else:
return x
if y :
if y.lower()==idi:
return '<w1>%s</w1>' % y
else:
return y
rigx = re.compile('(\d+)|(' + '|'.join(dico.keys()) + ')',re.IGNORECASE)
for i,el in enumerate(splitted[0::2]):
if el:
print '-----------------------------'
print '* index in splitted==',2*i
print '\n* el==\n',repr(el)
print '\n* rigx.findall(el)==\n',rigx.findall(el)
the = rigx.findall(el)
print '\n* modified el:\n',rigx.sub(repl,el)
splitted[2*i] = rigx.sub(repl,el)
print '\n\n##################################\n\n'
print 'modified splitted==\n',splitted
print
print ''.join(splitted)
结果
dico== {'german': '8', 'british': '7', 'france': '90'}
splitted==
['', '<s id="69-7">', '...Meanwhile is the studio 7 album by British pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.', '</s>\n', '']
=================
-----------------------------
* index in splitted== 2
* el==
'...Meanwhile is the studio 7 album by British pop band 10cc in France.'
* rigx.findall(el)==
[('7', ''), ('', 'British'), ('10', ''), ('', 'France')]
* modified el:
...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.
-----------------------------
* index in splitted== 6
* el==
'...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.'
* rigx.findall(el)==
[('', 'german'), ('8', ''), ('1978', '')]
* modified el:
...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.
-----------------------------
* index in splitted== 10
* el==
'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.'
* rigx.findall(el)==
[('', 'France'), ('90', '')]
* modified el:
Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.
##################################
modified splitted==
['', '<s id="69-7">', '...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.', '</s>\n', '']
<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>
我消除了 replmodel()
repl() 取 rigx.findall(el) 的值
我为此添加了一行= rigx.findall(el)
您应该使用正则表达式专门替换单个单词,而不是单词部分。
就像是
import re
left='british'
right='7'
i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
print(i2)
这给了我们'<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>'
如果这种方法导致错误,您可以尝试更精细的代码,例如
import re
def do(left, right, line):
parts = [x for x in re.split('(<[^>]+>)', line) if x]
for idx, l in enumerate(parts):
lu = l.upper()
if (not ('<s' in l or 's>' in l) and
(left.upper() in lu and right.upper() in lu)):
l = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', l)
l = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', l)
parts[idx] = l
return ''.join(parts)
line = '<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>'
print(do('british', '7', line))
print(do('british', '-7', line))
最好的方法是使用正则表达式。 但是如果'left'和'right'总是至少有一个尾随和前导空格,那么你可以使用一个简单的技巧(只需在你的模式中添加前导和尾随空格):
line = file.replace(' ' + left + ' ', ' <w1>' + left + '</w1> ')
lineR = line.replace(' ' + right + ' ', ' <w2>' + right + '</w2> ')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.