简体   繁体   English

如何使用python中的正则表达式替换未包含在标记中的子字符串

[英]How to replace substring that's not enclosed in tags with regex in python

I have sentences. 我有句子。

text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012".

I want to put <PERSON></PERSON> tag in "Obama", so the result will be like this: 我想将<PERSON></PERSON>标记放在“Obama”中,因此结果如下:
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".

I want to find substring(example: Obama) that there is no tag <PERSON> before the substring and there is no tag </PERSON> after the substring, but I don't know the right syntax for regex in python. 我想找到substring(例如:Obama),在子字符串之前没有标记<PERSON> ,并且子字符串后面没有标记</PERSON> ,但我不知道python中正则表达式的正确语法。
**I'm new to python :'' **我是python的新手:''

With simple regex re.sub(namedEntity, "<PERSON>"+namedEntity+"</PERSON>", text) will give an output 使用简单的正则表达式re.sub(namedEntity, "<PERSON>"+namedEntity+"</PERSON>", text)将给出一个输出
The president of America is <PERSON>Barack <PERSON>Obama</PERSON></PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".

this is my code(using python2.7) 这是我的代码(使用python2.7)

import re

result=re.sub(r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)

print "result: "+result

The output 输出
result: <PERSON>Obama</PERSON>
And I don't know that is the first "Obama" or the second one. 我不知道这是第一个“奥巴马”还是第二个。

Thanks for your help before 谢谢你的帮助

You are very close. 你很近。 In your new regex r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))" , you have .* before and after which matches 'Obama' with any characters before and after it and the lookarounds are ignored because the tags are in the matched group. 在你的新正则表达式r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))" ,你有.*之前和之后的'奥巴马'与之前的任何角色匹配之后,由于标签位于匹配的组中,因此忽略了外观。 If you remove them, you get the results you're after. 如果你删除它们,你会得到你想要的结果。

>>> import re
>>> text = "The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> namedEntity = 'Obama'
>>> result = re.sub(r"((?!<PERSON>)"+namedEntity+"(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
>>> print result
'The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'

For future regex testing, regex101 works well to check how things work as you change them live. 对于未来的正则表达式测试,regex101可以很好地检查在您实时更改它们时的工作方式。 For your case this shows what's happening. 对于您的情况, 显示了正在发生的事情。

just remove the .* part in your regex-lookarounds. 只需删除正则表达式中的.*部分。

>>>text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> surname=re.search(r'<PERSON>(.*)</PERSON>', text).group(1).split()[1]
>>> print surname
Obama
>>> re.sub(r'(?<!<PERSON>)'+surname+'(?!</PERSON>)', '<PERSON>'+surname+'</PERSON>', text)'  
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'
>>> 

Note: you can also extract the surname of the person using regex and capture groups which i have captured in surname variable. 注意:您还可以使用正则表达式提取人员的姓氏,并捕获我在surname变量中捕获的组。 You can use (?<!regex) to assert negative lookbehind and (?!regex) to assert negative lookahead 您可以使用(?<!regex)来断言负面的lookbehind和(?!regex)来断言负向前瞻

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM