简体   繁体   English

在python中使用正则表达式剥离标签

[英]Stripping tags using regex in python

How can I go about stripping the tags off this list:我怎样才能从这个列表中剥离标签:

['</span>A walk in the park<span class="html-tag"]

I managed to use (r'(?<=</span>)[^>]+') to remove the first tag but cant figure out how to remove the second.我设法使用(r'(?<=</span>)[^>]+')删除第一个标签,但不知道如何删除第二个。 I know regular expressions ain't the way to go for dealing with tags but just want to figure this out.我知道正则表达式不是处理标签的方法,但只是想弄清楚这一点。

You can use:您可以使用:

(?:>)(.*)(?:<)

In regex, every opened and closed round brakets defines a group.在正则表达式中,每个打开和关闭的圆形刹车都定义了一个组。 Here, we have 3 couples of rounded brackets but the first and the last one have a ?: inside.在这里,我们有 3 对圆括号,但第一个和最后一个有一个?: That means that the group being defined is a non-capturing group so it is needed to match the pattern but it will not be returned by the parser.这意味着被定义的组是一个非捕获组,因此需要匹配模式,但解析器不会返回它。 Instead, what you want is in group #1.相反,您想要的是第 1 组。

You were quite close with your regex.您与您的正则表达式非常接近。 After the position found by the lookbehind, you just want to read up to the next < :在lookbehind找到的位置之后,您只想阅读下一个<

(?<=</span>)[^<]+

Check it out on regex101regex101查看

$ cat test.py
import re
s='</span>A walk in the park<span class="html-tag"'
print re.findall(r'(?<=</span>)[^<]+', s)

$ python test.py
['A walk in the park']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM