[英]capture nested tags with regex?
s = '''<p>Plain text, <i>italicized phrase,
<i>italicized subphrase</i>, <b>bold
subphrase</b></i>, <i>other italic
phrase</i></p>'''
r1 = r'''(?sx)(
<i>(
(?!</?i>).
|
<i> ( (?!</?i>). )* </i>
)*</i>
)'''
I use r1 pattern to capture <i>...</i>
in string s. 我使用r1模式捕获字符串s中的
<i>...</i>
。 But <i>italicized subphrase</i>
can't be captured. 但是
<i>italicized subphrase</i>
无法捕获。 Why? 为什么?
I'm not dealing with HTML code really, but something similar with HTML's nest structure! 我不是在真正处理HTML代码,而是与HTML的嵌套结构类似! I'm just taking these codes for example.
我只是以这些代码为例。 My problem is how to capture both nested and nesting tags in only one layer nest structure.
我的问题是如何仅在一层嵌套结构中捕获嵌套和嵌套标签。
You are using a regular expression, and matching XML with such expressions get too complicated, too fast . 您正在使用正则表达式,并且将与此类表达式匹配的XML变得太复杂,太快 。
Please don't make it hard on yourself and use a HTML parser instead, Python has several to choose from: 请不要自欺欺人,而要使用HTML解析器,Python有多种选择:
ElementTree example: ElementTree示例:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('i'):
print ElementTree.tostring(elem)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.