简体   繁体   English

使用正则表达式删除python中标签之间的内容

[英]remove content between tags in python using regex

I was trying to clean up wikitext. 我正在尝试清理Wikitext。 Specifically I was trying to remove all the {{.....}} and <..>...</..> in the wikitext. 具体来说,我试图删除Wikitext中的所有{{.....}}<..>...</..> For example, for this wikitext: 例如,对于此Wikitext:

"{{Infobox UK place\\n|country = England\\n|official_name = Morcombelake\\n|static_image_name = Morecombelake from Golden Cap - geograph.org.uk - 1184424.jpg\\n|static_image_caption = Morcombelake as seen from Golden Cap\\n|coordinates = {{coord|50.74361|-2.85153|display=inline,title}}\\n|map_type = Dorset\\n|population = \\n|population_ref = \\n|shire_district = [[West Dorset]]\\n|shire_county = [[Dorset]]\\n|region = South West England\\n|constituency_westminster = West Dorset\\n|post_town = \\n|postcode_district = \\n|postcode_area = DT\\n|os_grid_reference = SY405938\\n|website = \\n}}\\n'''Morcombelake''' (also spelled '''Morecombelake''') is a small village near [[Bridport]] in [[Dorset]], [[England]], within the ancient parish of [[Whitchurch Canonicorum]]. [[Golden Cap]], part of the [[Jurassic Coast]] World Heritage Site, is nearby.{{cite web|url= http://www.nationaltrust.org.uk/golden-cap/|title=Golden Cap|publisher=National Trust|accessdate=2014-05-04}}\\n\\n== References ==\\n{{reflist}}\\n\\n{{We “ {{Infobox UK place \\ n | country = England \\ n | official_name = Morcombelake \\ n | static_image_name = Morecombelake from Golden Cap-geograph.org.uk-1184424.jpg \\ n | static_image_caption =从Golden Cap看到的Morcombelake \\ n |坐标= {{coord | 50.74361 | -2.85153 | display = inline,title}} \\ n | map_type =多塞特郡\\ n |人口= \\ n | population_ref = \\ n | shire_district = [[西多塞特郡]] \\ n | shire_county = [[[Dorset]] \\ n |区域=西南英格兰\\ n | constituency_westminster =西多塞特郡\\ n | post_town = \\ n | postcode_district = \\ n | postcode_area = DT \\ n | os_grid_reference = SY405938 \\ n |网站= \\ n }} \\ n'''Morcombelake'''(也叫'Morecombelake''')是位于[[Dorset]],[[England]]中[[Bridport]]附近的一个小村庄,位于[[Whitchurch Canonicorum] .. [[Golden Cap]],[[Jurassic Coast]]世界遗产的一部分,就在附近。{{cite web | url = http://www.nationaltrust.org.uk/golden -cap / | title = Golden Cap | publisher = National Trust | accessdate = 2014-05-04}} \\ n \\ n ==参考== \\ n {{reflist}} \\ n \\ n {{ st Dorset}}\\n\\n\\n{{Dorset-geo-stub}}\\n[[Category:Villages in Dorset]]\\n\\n== External Links ==\\n\\n*[ http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html Parish Church of St Gabriel]\\n\\n" st多塞特}} \\ n \\ n \\ n {{Dorset-geo-stub}}} \\ n [[类别:多塞特郡的村庄]] \\ n \\ n ==外部链接== \\ n \\ n * [ http:// www.goldencapteamofchurches.org.uk/morcombelakechurch.html圣加布里埃尔教区教堂] \\ n \\ n“

How can I use regular expressions in python to produce output like this: 如何在python中使用正则表达式产生如下输出:

\\n'''Morcombelake''' (also spelled '''Morecombelake''') is a small village near [[Bridport]] in [[Dorset]], [[England]], within the ancient parish of [[Whitchurch Canonicorum]]. \\ n'''Morcombelake'''(也叫'Morecombelake''')是[[Dorset]],[[England]]中[[Bridport]]附近的一个小村庄,位于[[ Whitchurch Canonicorum]]。 [[Golden Cap]], part of the [[Jurassic Coast]] World Heritage Site, is nearby.\\n\\n== References ==\\n\\n\\n\\n\\n\\n\\n[[Category:Villages in Dorset]]\\n\\n== External Links ==\\n\\n*[ http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html Parish Church of St Gabriel]\\n\\n [[侏罗纪海岸]]世界遗产的一部分[[金帽]]在附近。\\ n \\ n ==参考文献== \\ n \\ n \\ n \\ n \\ n \\ n \\ n [[类别:多塞特郡的村庄]] \\ n \\ n ==外部链接== \\ n \\ n * [ http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html圣加布里埃尔教区教堂] \\ n \\ n

As the tags are nested into each other, you can find and remove them in a loop: 由于标签彼此嵌套,因此可以循环查找和删除它们:

n = 1
while n > 0:
    s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)

s is a string containing the wikitext. s是包含维基文本的字符串。

There are no the <...> tags in your example, but they should be removed as well. 您的示例中没有<...>标记,但也应将其删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM