简体   繁体   English

用正则表达式捕获嵌套标签?

[英]capture nested tags with regex?

s = '''<p>Plain text, <i>italicized phrase,
 <i>italicized subphrase</i>, <b>bold
 subphrase</b></i>, <i>other italic
 phrase</i></p>'''

r1 = r'''(?sx)(
<i>(
(?!</?i>).
|
<i> ( (?!</?i>). )* </i>
)*</i>
)'''

I use r1 pattern to capture <i>...</i> in string s. 我使用r1模式捕获字符串s中的<i>...</i> But <i>italicized subphrase</i> can't be captured. 但是<i>italicized subphrase</i>无法捕获。 Why? 为什么?

I'm not dealing with HTML code really, but something similar with HTML's nest structure! 我不是在真正处理HTML代码,而是与HTML的嵌套结构类似! I'm just taking these codes for example. 我只是以这些代码为例。 My problem is how to capture both nested and nesting tags in only one layer nest structure. 我的问题是如何仅在一层嵌套结构中捕获嵌套和嵌套标签。

You are using a regular expression, and matching XML with such expressions get too complicated, too fast . 您正在使用正则表达式,并且将与此类表达式匹配的XML变得太复杂,太快

Please don't make it hard on yourself and use a HTML parser instead, Python has several to choose from: 请不要自欺欺人,而要使用HTML解析器,Python有多种选择:

  • ElementTree is part of the standard library ElementTree是标准库的一部分
  • BeautifulSoup is a popular 3rd party library BeautifulSoup是一个受欢迎的第三方图书馆
  • lxml is a fast and feature-rich C-based library. lxml是一个快速且功能丰富的基于C的库。

ElementTree example: ElementTree示例:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('i'):
    print ElementTree.tostring(elem)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM