使用正则表达式删除python中标签之间的内容

Question

我正在尝试清理Wikitext。 具体来说，我试图删除Wikitext中的所有{{.....}}和<..>...</..> 。 例如，对于此Wikitext：

“ {{Infobox UK place \\ n | country = England \\ n | official_name = Morcombelake \\ n | static_image_name = Morecombelake from Golden Cap-geograph.org.uk-1184424.jpg \\ n | static_image_caption =从Golden Cap看到的Morcombelake \\ n |坐标= {{coord | 50.74361 | -2.85153 | display = inline，title}} \\ n | map_type =多塞特郡\\ n |人口= \\ n | population_ref = \\ n | shire_district = [[西多塞特郡]] \\ n | shire_county = [[[Dorset]] \\ n |区域=西南英格兰\\ n | constituency_westminster =西多塞特郡\\ n | post_town = \\ n | postcode_district = \\ n | postcode_area = DT \\ n | os_grid_reference = SY405938 \\ n |网站= \\ n }} \\ n'''Morcombelake'''（也叫'Morecombelake'''）是位于[[Dorset]]，[[England]]中[[Bridport]]附近的一个小村庄，位于[[Whitchurch Canonicorum] .. [[Golden Cap]]，[[Jurassic Coast]]世界遗产的一部分，就在附近。{{cite web | url = http://www.nationaltrust.org.uk/golden -cap / | title = Golden Cap | publisher = National Trust | accessdate = 2014-05-04}} \\ n \\ n ==参考== \\ n {{reflist}} \\ n \\ n {{ st多塞特}} \\ n \\ n \\ n {{Dorset-geo-stub}}} \\ n [[类别：多塞特郡的村庄]] \\ n \\ n ==外部链接== \\ n \\ n * [ http：// www.goldencapteamofchurches.org.uk/morcombelakechurch.html圣加布里埃尔教区教堂] \\ n \\ n“

如何在python中使用正则表达式产生如下输出：

\\ n'''Morcombelake'''（也叫'Morecombelake'''）是[[Dorset]]，[[England]]中[[Bridport]]附近的一个小村庄，位于[[ Whitchurch Canonicorum]]。 [[侏罗纪海岸]]世界遗产的一部分[[金帽]]在附近。\\ n \\ n ==参考文献== \\ n \\ n \\ n \\ n \\ n \\ n \\ n [[类别：多塞特郡的村庄]] \\ n \\ n ==外部链接== \\ n \\ n * [ http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html圣加布里埃尔教区教堂] \\ n \\ n

Answer 1

由于标签彼此嵌套，因此可以循环查找和删除它们：

n = 1
while n > 0:
    s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)

s是包含维基文本的字符串。

您的示例中没有<...>标记，但也应将其删除。

使用正则表达式删除python中标签之间的内容

问题描述

1 个解决方案

解决方案1
0 2017-04-10 21:21:10

使用正则表达式删除python中标签之间的内容

问题描述

1 个解决方案

解决方案1 0 2017-04-10 21:21:10

解决方案1
0 2017-04-10 21:21:10