在Python中删除HTML标记和字符串

Question

I'm pretty new with regular expression. 我对正则表达式很陌生。 Basically, I would like to use regular expression to remove <sup> ... </sup> from the string using regular expression. 基本上，我想使用正则表达式使用正则表达式从字符串中删除<sup> ... </sup> 。

Input: 输入：

<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>

Output: 输出：

<b>something here</b>, another here

Is that a short way and description on how to do it? 这是一个简短的方法和说明吗？

note This question might be duplicated. 注意此问题可能重复。 I tried but couldn't find solution. 我尝试过但找不到解决方案。

Answer 1

You could do something like this: 您可以执行以下操作：

import re
s = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"

s2 = re.sub(r'<sup>(.*?)</sup>',"", s)

print s2
# Prints: <b>something here</b>, another here

Remember to use (.*?) , as (.*) is what they call a greedy quantifier and you would obtain a different result: 请记住使用(.*?) ，因为(.*)被他们称为贪婪量词，您将获得不同的结果：

s2 = re.sub(r'<sup>(.*)</sup>',"", s)

print s2
# Prints: <b>something here</b>

Answer 2

The hard part is knowing how to do a minimal rather than maximal match of the stuff between the tags. 困难的部分是知道如何对标签之间的内容进行最小匹配而不是最大匹配。 This works. 这可行。

import re
s0 = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"
prog = re.compile('<sup>.*?</sup>')
s1 = re.sub(prog, '', s0)
print(s1)
# <b>something here</b>, another here

在Python中删除HTML标记和字符串

问题描述

2 个解决方案

解决方案1
1 2016-08-19 19:48:43

解决方案2
1 已采纳 2016-08-19 19:52:47

在Python中删除HTML标记和字符串

问题描述

2 个解决方案

解决方案1 1 2016-08-19 19:48:43

解决方案2 1 已采纳 2016-08-19 19:52:47

解决方案1
1 2016-08-19 19:48:43

解决方案2
1 已采纳 2016-08-19 19:52:47