简体   繁体   English

Python正则表达式提取html段落

[英]Python regex to extract html paragraph

I'm trying to extract parapgraphs from HTML by using the following line of code:我正在尝试使用以下代码行从 HTML 中提取段落:

paragraphs = re.match(r'<p>.{1,}</p>', html)

but it returns none even though I know there is.但它没有返回,即使我知道有。 Why?为什么?

Why don't use an HTML parser to, well, parse HTML .为什么不使用HTML 解析器解析 HTML Example using BeautifulSoup :使用BeautifulSoup示例:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Note that text=True helps to filter out empty paragraphs.请注意, text=True有助于过滤掉空段落。

Make sure you use re.search (or re.findall ) instead of re.match , which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).确保您使用re.search (或re.findall )而不是re.match ,它会尝试匹配整个 html 字符串(您的 html 绝对不是以<p>标签开头和结尾)。

Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want.还应该注意,当前您的搜索是贪婪的,这意味着它将返回第一个<p>标记和最后一个</p>之间的所有内容,这是您绝对不想要的。 Try尝试

re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)

instead.反而。 The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search .问号将使您的正则表达式在第一个结束</p>标记处停止匹配,与search相比, findall将返回多个匹配项。

You should be using re.search instead of re.match .您应该使用re.search而不是re.match The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.前者将搜索整个字符串,而后者仅在模式位于字符串开头时才匹配。

That said, regular expressions are a horrible tool for parsing HTML.也就是说,正则表达式是解析 HTML 的可怕工具。 You will hit a wall with them very shortly.你很快就会和他们碰壁。 I strongly recommend you look at HTMLParser or BeautifulSoup for your task.我强烈建议您查看 HTMLParser 或 BeautifulSoup 来完成您的任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM