简体繁体 English

Python：需要使用正则表达式从 html 页面提取标签内容，但不是 BeautifulSoup

[英]Python: Need to extract tag content from html page using regex, but not BeautifulSoup

原文 2011-04-28 06:25:14 9 2 python/ html/ regex/ tags/ substring

I have a requirement wherein I have to extract content inside <raw> tag.我有一个要求，我必须在<raw>标记内提取内容。 For example I need to extract abcd and efgh from this html snippet:例如，我需要从这个 html 片段中提取abcd和efgh ：
<html><body><raw somestuff>abcd</raw><raw somesuff>efgh</raw></body></html>

I used this code in my python我在我的 python 中使用了这个代码
re.match(r'.*raw.*(.*)/raw.*', DATA)

But this is not returning any substring.但这不会返回任何 substring。 I'm not good at regex.我不擅长正则表达式。 So a correction to this or a new solution would help me a great deal.因此，对此进行更正或新的解决方案将对我有很大帮助。 I am not supposed to use external libs (due to some restriction in my company).我不应该使用外部库（由于我公司的一些限制）。

2 个解决方案

Your company really needs to rethink their policy.您的公司确实需要重新考虑他们的政策。 Rewriting an XML parser is a complete waste of time, there are already several for Python.重写 XML 解析器完全是浪费时间，Python 已经有好几个了。 Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html . Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html .

You really should be using one of those.你真的应该使用其中之一。 No sense duplicating all of that work.重复所有这些工作是没有意义的。

Using non greedy matching (*?) can do this easily, at least for your example.至少对于您的示例，使用非贪婪匹配 (*?) 可以轻松做到这一点。

re.findall(r'<raw[^>]*?>(.*?)</raw>', DATA)

使用python BeautifulSoup从html中提取某些内容 - Extract a certain content from html using python BeautifulSoup

Python：如何使用BeautifulSoup从HTML页面中提取URL？ - Python: How to extract URL from HTML Page using BeautifulSoup?

如何在Python中使用BeautifulSoup从HTML页面提取表内容？ - How to extract Table contents from an HTML page using BeautifulSoup in Python?

Python 正则表达式提取 html 标签的 src 内容？ - Python Regex to extract content of src of an html tag?

如何<a>在Python中使用Regex</a>从HTML <a>标记中</a>提取Facebook页面URL <a>？</a> - How to extract Facebook page URL from HTML <a> tag using Regex in Python?

使用BeautifulSoup从html页面提取链接 - Extract links from html page using BeautifulSoup

使用Beautifulsoup从html页面提取数据 - Extract data from html page using Beautifulsoup

仅使用BeautifulSoup从html中提取除脚本标记内容之外的文本 - Extract text only except the content of script tag from html with BeautifulSoup

使用Python（Beautifulsoup）从html提取列 - Extract Columns from html using Python (Beautifulsoup)

在 Python 中使用 BeautifulSoup 从 HTML Script 标签中提取 JSON - Extract JSON from HTML Script tag with BeautifulSoup in Python

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python BeautifulSoup从html中提取某些内容 - Extract a certain content from html using python BeautifulSoup Python：如何使用BeautifulSoup从HTML页面中提取URL？ - Python: How to extract URL from HTML Page using BeautifulSoup? 如何在Python中使用BeautifulSoup从HTML页面提取表内容？ - How to extract Table contents from an HTML page using BeautifulSoup in Python? Python 正则表达式提取 html 标签的 src 内容？ - Python Regex to extract content of src of an html tag? 如何<a>在Python中使用Regex</a>从HTML <a>标记中</a>提取Facebook页面URL <a>？</a> - How to extract Facebook page URL from HTML <a> tag using Regex in Python? 使用BeautifulSoup从html页面提取链接 - Extract links from html page using BeautifulSoup 使用Beautifulsoup从html页面提取数据 - Extract data from html page using Beautifulsoup 仅使用BeautifulSoup从html中提取除脚本标记内容之外的文本 - Extract text only except the content of script tag from html with BeautifulSoup 使用Python（Beautifulsoup）从html提取列 - Extract Columns from html using Python (Beautifulsoup) 在 Python 中使用 BeautifulSoup 从 HTML Script 标签中提取 JSON - Extract JSON from HTML Script tag with BeautifulSoup in Python

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM