[英]Stripping tags using regex in python
How can I go about stripping the tags off this list:我怎样才能从这个列表中剥离标签:
['</span>A walk in the park<span class="html-tag"]
I managed to use (r'(?<=</span>)[^>]+')
to remove the first tag but cant figure out how to remove the second.我设法使用(r'(?<=</span>)[^>]+')
删除第一个标签,但不知道如何删除第二个。 I know regular expressions ain't the way to go for dealing with tags but just want to figure this out.我知道正则表达式不是处理标签的方法,但只是想弄清楚这一点。
You can use:您可以使用:
(?:>)(.*)(?:<)
In regex, every opened and closed round brakets defines a group.在正则表达式中,每个打开和关闭的圆形刹车都定义了一个组。 Here, we have 3 couples of rounded brackets but the first and the last one have a ?:
inside.在这里,我们有 3 对圆括号,但第一个和最后一个有一个?:
。 That means that the group being defined is a non-capturing group so it is needed to match the pattern but it will not be returned by the parser.这意味着被定义的组是一个非捕获组,因此需要匹配模式,但解析器不会返回它。 Instead, what you want is in group #1.相反,您想要的是第 1 组。
You were quite close with your regex.您与您的正则表达式非常接近。 After the position found by the lookbehind, you just want to read up to the next <
:在lookbehind找到的位置之后,您只想阅读下一个<
:
(?<=</span>)[^<]+
Check it out on regex101在regex101上查看
$ cat test.py
import re
s='</span>A walk in the park<span class="html-tag"'
print re.findall(r'(?<=</span>)[^<]+', s)
$ python test.py
['A walk in the park']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.