简体   繁体   English

Python:需要使用正则表达式从 html 页面提取标签内容,但不是 BeautifulSoup

[英]Python: Need to extract tag content from html page using regex, but not BeautifulSoup

I have a requirement wherein I have to extract content inside <raw> tag.我有一个要求,我必须在<raw>标记内提取内容。 For example I need to extract abcd and efgh from this html snippet:例如,我需要从这个 html 片段中提取abcdefgh
<html><body><raw somestuff>abcd</raw><raw somesuff>efgh</raw></body></html>

I used this code in my python我在我的 python 中使用了这个代码
re.match(r'.*raw.*(.*)/raw.*', DATA)

But this is not returning any substring.但这不会返回任何 substring。 I'm not good at regex.我不擅长正则表达式。 So a correction to this or a new solution would help me a great deal.因此,对此进行更正或新的解决方案将对我有很大帮助。 I am not supposed to use external libs (due to some restriction in my company).我不应该使用外部库(由于我公司的一些限制)。

Your company really needs to rethink their policy.您的公司确实需要重新考虑他们的政策。 Rewriting an XML parser is a complete waste of time, there are already several for Python.重写 XML 解析器完全是浪费时间,Python 已经有好几个了。 Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html . Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html .

You really should be using one of those.你真的应该使用其中之一。 No sense duplicating all of that work.重复所有这些工作是没有意义的。

Using non greedy matching (*?) can do this easily, at least for your example.至少对于您的示例,使用非贪婪匹配 (*?) 可以轻松做到这一点。

re.findall(r'<raw[^>]*?>(.*?)</raw>', DATA)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM