Python正则表达式提取html段落

Question

I'm trying to extract parapgraphs from HTML by using the following line of code:我正在尝试使用以下代码行从 HTML 中提取段落：

paragraphs = re.match(r'<p>.{1,}</p>', html)

but it returns none even though I know there is.但它没有返回，即使我知道有。 Why?为什么？

Answer 1

Why don't use an HTML parser to, well, parse HTML .为什么不使用HTML 解析器来解析 HTML 。 Example using BeautifulSoup :使用BeautifulSoup示例：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Note that text=True helps to filter out empty paragraphs.请注意， text=True有助于过滤掉空段落。

Answer 2

Make sure you use re.search (or re.findall ) instead of re.match , which attempts to match the entire html string (your html is definitely not beginning and ending with  tags).确保您使用re.search （或re.findall ）而不是re.match ，它会尝试匹配整个 html 字符串（您的 html 绝对不是以标签开头和结尾）。

Should also note that currently your search is greedy meaning it will return everything between the first  tag and the last  which is something you definitely do not want.还应该注意，当前您的搜索是贪婪的，这意味着它将返回第一个标记和最后一个之间的所有内容，这是您绝对不想要的。 Try尝试

re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)

instead.反而。 The question mark will make your regex stop matching at the first closing  tag, and findall will return multiple matches compared to search .问号将使您的正则表达式在第一个结束标记处停止匹配，与search相比， findall将返回多个匹配项。

Answer 3

You should be using re.search instead of re.match .您应该使用re.search而不是re.match 。 The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.前者将搜索整个字符串，而后者仅在模式位于字符串开头时才匹配。

That said, regular expressions are a horrible tool for parsing HTML.也就是说，正则表达式是解析 HTML 的可怕工具。 You will hit a wall with them very shortly.你很快就会和他们碰壁。 I strongly recommend you look at HTMLParser or BeautifulSoup for your task.我强烈建议您查看 HTMLParser 或 BeautifulSoup 来完成您的任务。

Python正则表达式提取html段落

问题描述

3 个解决方案

解决方案1
11 已采纳 2015-12-29 01:44:43

解决方案2
6 2015-12-29 01:40:57

解决方案3
2 2015-12-29 01:40:33

Python正则表达式提取html段落

问题描述

3 个解决方案

解决方案1 11 已采纳 2015-12-29 01:44:43

解决方案2 6 2015-12-29 01:40:57

解决方案3 2 2015-12-29 01:40:33

解决方案1
11 已采纳 2015-12-29 01:44:43

解决方案2
6 2015-12-29 01:40:57

解决方案3
2 2015-12-29 01:40:33