简体   繁体   English

在Python中删除HTML标记和字符串

[英]Remove html tag and string in between in Python

I'm pretty new with regular expression. 我对正则表达式很陌生。 Basically, I would like to use regular expression to remove <sup> ... </sup> from the string using regular expression. 基本上,我想使用正则表达式使用正则表达式从字符串中删除<sup> ... </sup>

Input: 输入:

<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>

Output: 输出:

<b>something here</b>, another here

Is that a short way and description on how to do it? 这是一个简短的方法和说明吗?

note This question might be duplicated. 注意此问题可能重复。 I tried but couldn't find solution. 我尝试过但找不到解决方案。

You could do something like this: 您可以执行以下操作:

import re
s = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"

s2 = re.sub(r'<sup>(.*?)</sup>',"", s)

print s2
# Prints: <b>something here</b>, another here

Remember to use (.*?) , as (.*) is what they call a greedy quantifier and you would obtain a different result: 请记住使用(.*?) ,因为(.*)被他们称为贪婪量词,您将获得不同的结果:

s2 = re.sub(r'<sup>(.*)</sup>',"", s)

print s2
# Prints: <b>something here</b>

The hard part is knowing how to do a minimal rather than maximal match of the stuff between the tags. 困难的部分是知道如何对标签之间的内容进行最小匹配而不是最大匹配。 This works. 这可行。

import re
s0 = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"
prog = re.compile('<sup>.*?</sup>')
s1 = re.sub(prog, '', s0)
print(s1)
# <b>something here</b>, another here

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM