繁体 English 中英

Python：需要使用正则表达式从 html 页面提取标签内容，但不是 BeautifulSoup

[英]Python: Need to extract tag content from html page using regex, but not BeautifulSoup

原文 2011-04-28 06:25:14 0 2 python/ html/ regex/ tags/ substring

我有一个要求，我必须在<raw>标记内提取内容。 例如，我需要从这个 html 片段中提取abcd和efgh ：
<html><body><raw somestuff>abcd</raw><raw somesuff>efgh</raw></body></html>

我在我的 python 中使用了这个代码
re.match(r'.*raw.*(.*)/raw.*', DATA)

但这不会返回任何 substring。 我不擅长正则表达式。 因此，对此进行更正或新的解决方案将对我有很大帮助。 我不应该使用外部库（由于我公司的一些限制）。

2 个解决方案

您的公司确实需要重新考虑他们的政策。 重写 XML 解析器完全是浪费时间，Python 已经有好几个了。 Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html .

你真的应该使用其中之一。 重复所有这些工作是没有意义的。

至少对于您的示例，使用非贪婪匹配 (*?) 可以轻松做到这一点。

re.findall(r'<raw[^>]*?>(.*?)</raw>', DATA)

使用python BeautifulSoup从html中提取某些内容

[英]Extract a certain content from html using python BeautifulSoup

Python：如何使用BeautifulSoup从HTML页面中提取URL？

[英]Python: How to extract URL from HTML Page using BeautifulSoup?

如何在Python中使用BeautifulSoup从HTML页面提取表内容？

[英]How to extract Table contents from an HTML page using BeautifulSoup in Python?

Python 正则表达式提取 html 标签的 src 内容？

[英]Python Regex to extract content of src of an html tag?

如何<a>在Python中使用Regex</a>从HTML <a>标记中</a>提取Facebook页面URL <a>？</a>

[英]How to extract Facebook page URL from HTML <a> tag using Regex in Python?

使用BeautifulSoup从html页面提取链接

[英]Extract links from html page using BeautifulSoup

使用Beautifulsoup从html页面提取数据

[英]Extract data from html page using Beautifulsoup

仅使用BeautifulSoup从html中提取除脚本标记内容之外的文本

[英]Extract text only except the content of script tag from html with BeautifulSoup

使用Python（Beautifulsoup）从html提取列

[英]Extract Columns from html using Python (Beautifulsoup)

在 Python 中使用 BeautifulSoup 从 HTML Script 标签中提取 JSON

[英]Extract JSON from HTML Script tag with BeautifulSoup in Python

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python BeautifulSoup从html中提取某些内容 Python：如何使用BeautifulSoup从HTML页面中提取URL？如何在Python中使用BeautifulSoup从HTML页面提取表内容？ Python 正则表达式提取 html 标签的 src 内容？如何<a>在Python中使用Regex</a>从HTML <a>标记中</a>提取Facebook页面URL <a>？</a> 使用BeautifulSoup从html页面提取链接使用Beautifulsoup从html页面提取数据仅使用BeautifulSoup从html中提取除脚本标记内容之外的文本使用Python（Beautifulsoup）从html提取列在 Python 中使用 BeautifulSoup 从 HTML Script 标签中提取 JSON

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM