用正则表达式捕获嵌套标签？

Question

s = '''<p>Plain text, <i>italicized phrase,
 <i>italicized subphrase</i>, <b>bold
 subphrase</b></i>, <i>other italic
 phrase</i></p>'''

r1 = r'''(?sx)(
<i>(
(?!</?i>).
|
<i> ( (?!</?i>). )* </i>
)*</i>
)'''

I use r1 pattern to capture ... in string s. 我使用r1模式捕获字符串s中的... 。 But italicized subphrase can't be captured. 但是italicized subphrase无法捕获。 Why? 为什么？

I'm not dealing with HTML code really, but something similar with HTML's nest structure! 我不是在真正处理HTML代码，而是与HTML的嵌套结构类似！ I'm just taking these codes for example. 我只是以这些代码为例。 My problem is how to capture both nested and nesting tags in only one layer nest structure. 我的问题是如何仅在一层嵌套结构中捕获嵌套和嵌套标签。

Answer 1

You are using a regular expression, and matching XML with such expressions get too complicated, too fast . 您正在使用正则表达式，并且将与此类表达式匹配的XML变得太复杂，太快。

Please don't make it hard on yourself and use a HTML parser instead, Python has several to choose from: 请不要自欺欺人，而要使用HTML解析器，Python有多种选择：

ElementTree is part of the standard library ElementTree是标准库的一部分
BeautifulSoup is a popular 3rd party library BeautifulSoup是一个受欢迎的第三方图书馆
lxml is a fast and feature-rich C-based library. lxml是一个快速且功能丰富的基于C的库。

ElementTree example: ElementTree示例：

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('i'):
    print ElementTree.tostring(elem)

用正则表达式捕获嵌套标签？

问题描述

1 个解决方案

解决方案1
2 2013-01-04 07:47:53

用正则表达式捕获嵌套标签？

问题描述

1 个解决方案

解决方案1 2 2013-01-04 07:47:53

解决方案1
2 2013-01-04 07:47:53