简体   繁体   English

python正则表达式解析div标签

[英]python regular expression to parse div tags

a question about python regular expression. 有关python正则表达式的问题。

I would like to match a div block like 我想匹配一个div块

<div class="leftTail"><ul class="hotnews">any news stuff</ul></div>

I was thinking a pattern like 我在想一个像

p = re.compile(r'<div\s+class=\"leftTail\">[^(div)]+</div>')

but it seems not working properly 但似乎无法正常工作

another pattern 另一种模式

p = re.compile(r'<div\s+class=\"leftTail\">[\W|\w]+</div>')

i got much more than i think, it gets all the stuff until the last tag in the file. 我得到的比我想象的要多得多,它可以获取所有内容,直到文件中的最后一个标签为止。

Thanks for any help 谢谢你的帮助

You might want to consider graduating to an actual HTML parser. 您可能要考虑升级到实际的HTML解析器。 I suggest you give Beautiful Soup a try. 我建议您尝试一下美丽汤 There are many crazy ways for HTML to be formatted, and the regular expressions may not work correctly all the time, even if you write them correctly. 有许多疯狂的方法可以格式化HTML,即使正确编写了正则表达式,也可能无法始终正常工作。

Don't use regular expressions to parse XML or HTML. 不要使用正则表达式来解析XML或HTML。 You'll never be able to get it to work correctly for nested divs. 您将永远无法使它对于嵌套div正常工作。

尝试这个:

p = re.compile(r'<div\s+class=\"leftTail\">.*?</div>')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM